Effectiveness of a Numeracy Intelligent Tutoring System in Kindergarten: A Conceptual Replication

Intelligent Tutoring Systems are a genre of highly adaptive software providing individualized instruction. The current study was a conceptual replication of a previous randomized control trial that incorporated the intelligent tutoring system Native Numbers, a program designed for early numeracy instruction. As a conceptual replication, we kept the method of instruction, the demographics, the number of kindergarten classrooms (n = 3), and the same numeracy and intrinsic motivation screeners as the original study. We changed the time of year of instruction, changed the control group to a wait-control group, added a maintenance assessment for the first group of participants, and included a mathematical language assessment. Analysis of withinand between-group differences using repeated measures ANOVA indicated gains of numeracy were significant only after using Native Numbers (Partial Eta Square = 0.147). Results of intrinsic motivation and mathematical language were not significant. The effect size of numeracy achievement did not reach that of the original study (Partial Eta Square = 0.622). Here, we compared the two studies, discussed plausible reasons for differences in the magnitude of effect sizes, and provided suggestions for future research.

. Given the long-term negative impact of low numeracy prior to entering the primary grades (Geary, 2015), prevention research in preschool and kindergarten is both logical and recommended to prevent the risk of MD and to help build the emerging body of research identifying characteristics of MD and MLD (Morgan et al., 2016;Nelson & Powell, 2018;Nguyen et al., 2016;Penner et al., 2019).
In the United States, the National Association of Education of Young Children (NAEYC, 2012) and the National Council of Teachers of Mathematics (NCTM, 2010) recommended the use of technology as a tool to help develop conceptual understanding of mathematics for young students; and researchers support the use of technology as one means of differentiated instruction (Deunk et al., 2018;Twyman & Sota, 2016). Reported effectiveness of technology to increase numerical skills, however, has been mixed and may be due to methodological issues, the pedagogy and content of the software, or the software's architecture (Cheung & Slavin, 2013;Harskamp, 2014;Moyer-Packenham et al., 2019;Ok et al., 2020;Twyman & Sota, 2016). Thus, evaluating the effectiveness of mathematics instructional software also requires identifying the software's architecture to help understand potential mechanisms driving academic outcomes (e.g., Larkin & Milford, 2018;Moyer-Packenham et al., 2019;Ok et al., 2020).

Intelligent Tutoring Systems
The architectural designs of educational software include, for example, models wherein a user works through a series of items in a linear fashion, models that allow users to adjust instruction based on preference, and models that adapt instruction (VanLehn, 2011). The degree to which a program adapts instruction also varies, prompting discussion among experts in the artificial intelligence for education community regarding the need for operational definitions of the term adaptive and for standards and the criteria necessary for software to qualify as being labeled adaptive (e.g., Aleven, 2015;R. S. Baker, 2016;Ma et al., 2014;Rozo & Real, 2019;Wilson & Scott, 2017; also see https://standards.ieee.org/project/ 2247_1.html for proposed standards). An intelligent tutoring system (ITS) is a distinctly different type of adaptive instructional software designed with complex models that embed individualized instruction in tandem with dynamic assessment (Ma et al., 2014;Pirolli, 2014). The term tutoring, in an ITS, describes an active, adaptive, student-centered, individualized mode of delivery (Bourdeau & Grandbastien, 2010). Ma et al. (2014) explained that an ITS: "computes inferences from student responses, constructs either a persistent multidimensional mod el of the student's psychological states (such as subject matter knowledge, learning strategies, motivations, or emotions) or locates the student's current psychological state in a multidimensional domain model... " and "uses the student modeling functions…to adapt one or more of the tutoring functions" (p. 902).
Critically, the student model is the software design element that differentiates an ITS from other educational instruc tional software (Ma et al., 2014;Pavlik et al., 2013). The student model, also referred to as a learner model (Pelánek, 2017), describes the psychological state (i.e., cognitive state) of the user (Ma et al., 2014).
One potential advantage of utilizing an ITS for instruction is that elements required within the software's architec ture inherently include practices some researchers have recognized as evidence-based: explicit instruction, adapting instruction to individual needs based on assessment, and providing feedback (e.g., Barnes et al., 2016;Clarke et al., 2015;Dennis et al., 2016;Mononen et al., 2014;Nelson & McMaster, 2019;Wang et al., 2016). That is not to say that other instructional software does not include evidence-based practices, nor suggest that inclusion or exclusion of evidence-based practices alone can predict academic outcomes (e.g., Kiru et al., 2018). Nonetheless, meta-analyses and reviews of ITSs revealed that some ITSs were more effective than small-group instruction and almost as effective as one-to-one tutoring (Kulik & Fletcher, 2016;Ma et al., 2014;Steenbergen-Hu & Cooper;VanLehn, 2011). VanLehn, 2011) included the use of an ITS designed for early numeracy instruction. In a systematic review of adaptive early numeracy software in grades pre-school through first grade, (authors, in preparation) identified one unpublished dissertation (Dias, 2016) in which the author included an early numeracy ITS: Native Numbers (www.nativebrain.com, Native Brain, 2014), an ITS designed for delivery via iPads.
The location of the Dias (2016) study was in a private school in the United States. Dias randomly assigned partici pants (n = 57) within three kindergarten classrooms into either the treatment group or the active control group. During regularly scheduled center activity time, both groups worked on supplemental numeracy activities; the treatment group used Native Numbers (Native Brain, 2014) and the active control group worked on paper-and-pencil numeracy enrichment problems with content aligned to Native Numbers' content. The students' classroom teachers monitored all students' activity and provided encouragement or responded to questions regarding technology but did not provide instruction. The student-to-teacher ratio was 1:19. At the individual level, once a student had completed Native Numbers or the enrichment activities, they received post-tests, and moved into business-as-usual mathematics center activities. Dias (2016) reported a substantially large effect size on differences in mathematics outcome scores for participants in the treatment groups compared to the active control groups: η p 2 = 0.622. Based on the need for early numeracy instruction to prevent MD (Geary, 2015;Morgan et al., 2016;Nguyen et al., 2016;Penner et al., 2019), the effectiveness of Native Numbers (Native Brain, 2014) reported by Dias (2016), the call for replication studies (e.g., the Standards of Evidence for Efficacy, Effectiveness and Scale-up Research in Prevention Science: Next Generation, Gottfredson et al., 2015), and the lack of studies incorporating ITSs for early numeracy instruction (authors, in preparation), replication of the Dias (2016) study was warranted. The structure of the remainder of the paper includes the purpose of the current conceptual replication study, descriptions and comparisons of the methods and outcomes of the original study and the current study, and discussion of how the results of both studies provide an opportunity to design future research to examine aspects of numerical cognition that are either emerging lines of research or are missing in the literature (e.g., ordinality, mathematical language, multiple representations, number rods).

Purpose of the Study
The purpose of the current study was to conduct a conceptual replication of the Dias (2016) study. Conceptual replications differ from direct replications in that specific features from the original study are intentionally manipulated to investigate theoretically important variables while maintaining critical elements of the original study (Cai et al., 2018;Coyne et al., 2016). For the current study, we maintained the participant demographics as closely as possible, the treatment instrument (i.e., Native Numbers; Native Brain, 2014), and the two outcome measures. However, we identified four theoretically relevant elements to manipulate: a. implementing the treatment in the spring b. providing the same treatment to a wait-control group c. assessing maintenance of academic gains over time d. assessing mathematical language Dias (2016) conducted the study in the fall, at the beginning of the school year. We reasoned staring a study later in the year allowed exploring the variable of maturation and opportunities for students to experience formal mathematics instruction after entering kindergarten. Changing the control group to a wait-control group allowed measuring mainte nance of any academic gains the first-treatment group may have had, comparing outcomes of two groups of participants receiving treatment at different points in time, and providing a control for the novelty effect from the use of technology.
The fourth change to the Dias (2016) study was the inclusion of a mathematical language assessment. Beyond mapping number words to symbolic or non-symbolic notations, mathematical language includes words that describe re lationships and space; for example, terms such as "greater" and "below" (Hornburg et al., 2018). Mathematical language is critical for understanding concepts of cardinality, ordinality, sets, and magnitude (Hornburg et al., 2018). Despite the emerging evidence of the importance of mathematical language, to date, few experimental studies have included and measured explicit instruction of mathematical language in early childhood; notable exceptions include studies utilizing storybooks in prekindergarten (Hojnoski et al., 2014;Purpura et al., 2017) and kindergarten (Hassinger-Das et al., 2015;Jennings et al., 1992), and a study using manipulatives, hand gestures, and verbal instruction of mathematical terms in Gottfried (1990) reported reliability and validity based on a cross-sectional study and a longitudinal study. Both studies included 7-to 9-year-old participants; although, the ages and number of participants varied for each analysis. Combining the data from both studies, Gottfried reported strong internal consistency for reading motivation (α = 0.82), math motivation (α = 0.84), and motivation for overall school experience (α = 0.82). For the cross-sectional study, two-month test-retest correlations (n = 57) were r = .73 for the reading subscale and r = .73 for the math subscale (p < .001). Correlations between the total score of all scales and researcher developed questions of students' perception of competence were r(96) = .37 and r(105) = .35, p < .001, for the 7-and 8-year-old participants in the cross-sectional and the longitudinal studies respectively. Correlations between teacher reports of student motivation across the subscales ranged from r = .18, p < .05, to r = .25, < .001; correlations by subscale were not reported. Correlations between the Young Children's Academic Intrinsic Motivation Inventory and a modified version of the Children's Academic Anxiety Inventory (Gottfried, 1982(Gottfried, , 1985 were r(96) = −.25, p < .01, for participants in the cross-sectional study and r(103) = −.20, p < .01, for 7-year-old participants in the longitudinal study (see Gottfried, 1990, p. 535).

Numeracy
Similar to Dias (2016), we assessed numeracy via the Number Sense Screener™ (Jordan et al., 2012). The screener includes 29 items, is standardized, and takes approximately 15 minutes to administer one-on-one. Each item answered correctly receives one point. The screener assesses six numerical concepts: counting, number recognition, number com parisons, non-verbal calculation, story problems (oral addition and subtraction), and single-digit number combinations. Additionally, the screener is normed for kindergarten students in the fall and spring and for first-grade students in the fall. Jordan et al. analyzed construct and predictive validity using the Woodcock-Johnson III calculation and applied problem scales (WJ, Woodcock et al., 2007). The correlation between the screener and the WJ composite scores of the subscales produced an average r = .62 for 1 st grade and r = .65 for 3 rd grade (N = 288). The fall of 1 st grade WJ composite scores correlated (r = .72) with the spring of 1 st grade WJ composite scores (n = 279) and with the spring of 3 rd grade WJ composite scores (r = .70; n = 175). Jordan et al. reported an average internal consistency of α = 0.85.

Mathematical Language
To our knowledge, a validated stand-alone kindergarten assessment of mathematical language in English does not exist. Therefore, we chose the Preschool Assessment of Mathematical Language (Purpura & Logan, 2015), a research er-developed measure currently undergoing psychometric evaluation to assess mathematical vocabulary. We obtained permission (Purpura; personal communication) to use the current version of the assessment that includes 16 items drawn from a larger battery (e.g., Purpura & Logan, 2015). The 16 items include six quantitative terms (e.g., more, less) and ten spatial terms (e.g., under, last). Students respond to pictures with oral prompts such as "Point to the box with more dots, " where more is the target vocabulary term. The assessment takes approximately 10 minutes to administer one-on-one and students receive one point for each correct item. Convergent validity has not been reported; however, this measure has shown strong internal consistency across several studies: α = 0.85 in Purpura and Logan (2015), α = 0.78 in Purpura et al. (2017), α = 0.78 in Hornburg et al. (2018), and α = 0.80 in Purpura et al. (2020).

Numeracy Instruction All Participants: Regular Numeracy Instruction
In both the Dias (2016) study and the current study, all participants received regular classroom mathematics instruction, and the treatment groups and control groups received additional supplemental mathematics instruction. Thus, the control groups for both studies were active-control groups. In the current study, all participants received regular daily mathematics instruction using Bridges Number Corner (see https://www.mathlearningcenter.org/number-corner). Dias did not indicate the specific curriculum used at the school site.

Treatment Groups: Native Numbers
For both the Dias (2016) study and the current study, participants in the treatment conditions received instruction via Native Numbers (Native Brain, 2014), an application (i.e., app) designed for use on the iPad, available for free, and downloaded via iTunes. The adaptive algorithm in Native Numbers includes measuring accuracy and, at higher levels of difficulty, response time as a measure of fluency and as an indicator within the architecture to further adapt instruction. Based on the description provided by the parent company of Native Numbers (www.nativebrain.com), the program adapts instruction in real-time, along five levels of academic performance, moving users to more difficult tasks, or returning to earlier concepts when a user experiences difficulty. Once a user has reached the third level, the next set of activities is unlocked. Once unlocked, users can continue working on accuracy and fluency (Levels 4 and 5) on their current activity, move on to the next activity, or move back and forth between any of the unlocked activities.
Native Numbers (Native Brain, 2014) provides instruction for five areas of early numeracy for quantities 1-9: number concepts (mapping the verbal quantity to a visual representation); number relations (instruction of mathematical vocabulary such as more and less); ordinality; counting (one-to-one, counting on, and counting back); and a set of activities named "Demonstrate Mastery" which requires consolidation of all previous learning. Each of these five areas is further sub-divided into five sets of activities for a total of 25 sets of activities (see Appendix A in the Supplementary Materials).
Native Numbers (Native Brain, 2014) incorporates symbolic representation (i.e., Arabic digits) and two non-symbol ic representations of discrete magnitude: (a) sets presented in arrays similar to playing cards and (b) tally marks. Additionally, Native Numbers includes one non-symbolic representation of continuous magnitude: number rods similar to Cuisenaire Rods ® . Users learn to map between quantities for each of the different representations through blocked practice and interleaved practice. Also, as part of the Demonstrate Mastery activities, some activities are designed with reversibility (e.g., Simon et al., 2016); for example, given a target quantity and a starting amount, a user must add or remove a quantity to reach the target quantity. Native Numbers, however, does not include symbolic operational signs.

Control Groups: Numeracy Enrichment Activities
Participants in the Dias (2016) study received additional numeracy instruction through paper and pencil activities aligned to Native Numbers' (Native Brain, 2014) content. Participants in the current study had daily center-based supplemental instruction for approximately 45 minutes in the afternoon. Students worked in small groups with teachers, one-on-one with teachers, or in peer-to-peer activities during this time. The numeracy enrichment activities during this center-based time varied throughout the study, based on the teachers' goals and individual students' needs, and included other instruction such as reading, writing, science, and art (see Appendix B in the Supplementary Materials for a list of the different numeracy activities used during the timeframe of the study).

Procedures
The current study was approved by the University of Texas at Austin Institutional Review Board (IRB). Teachers and parents/guardians provided written consent, and students provided verbal and written assent. Prior to beginning the assessments, we used an online random number generator to randomize the participants by their primary groups into the first-treatment group or the wait-control group. For ethical and pragmatic reasons, all students had the opportunity to use Native Numbers (Native Brain, 2014); however, analyses included only data from the participants who returned consent forms and provided assent (n = 46). The teachers and parents/guardians received outcome scores from all the assessments, and the teachers received typed narrative reports on student progress.

Baseline and Pre-Tests
In the Dias (2016) study, additional school staff helped facilitate the assessments. The teachers at the site of the current study requested that the researchers conduct all the assessments. Pre-testing in the current study occurred in either the large multipurpose rooms or smaller areas throughout the building. Unlike Dias, we included an assessment of mathematical language and combined this assessment with the Number Sense Screener™ (Jordan et al., 2012) into one sitting, administered one-on-one. After completing all the assessments, participants began instruction in their separate groups; however, due to the large number of participants, we phased in the first-treatment group over three days.
Assessment of intrinsic motivation via the Young Children's Academic Intrinsic Motivation Inventory (Gottfried, 1990) differed significantly in the current study from the Dias (2016) study. Specifically, during a pilot administering the inventory to nine kindergarten students, we found that students had difficulty understanding the directions for responding to negative statements. For example, we included three practice statements as pre-training to ensure participants understood the format and how to respond. For negative statements such as "I do not like ice-cream, " participants responded "very true" when in fact, the students meant the opposite response; they did like ice-cream. This confusion occurred across the negative response statements of the practice statements and the statements in the inventory, despite explicit modeling and feedback with the practice statements. Additionally, facilitating the pilot of the inventory took more than 30 minutes for only two of the three sub-measures.
Concerned about participant testing fatigue and reliability of responses, we removed the overall schoolwork sub measure and we removed two questions each from the reading and mathematics sub-measures (the same questions for both content areas): one indicating negative intrinsic motivation and one indicating positive intrinsic motivation. Also, due to time constraints and researcher availability, we chose to implement the Young Children's Academic Intrinsic Motivation Inventory (Gottfried, 1990) in small groups of three, rather than one-on-one. Additionally, as our question of interest was whether the use of Native Numbers (Native Brain, 2014) increased intrinsic motivation, we changed the timing of the pre-test for each group such that we administered the inventory immediately before using Native Numbers (i.e., the pre-test assessment for the wait-control group was later than the first-treatment group, occurring just prior to using Native Numbers).

Treatment
Dias (2016) reported participants in both groups (i.e., treatment and control) worked for up to 30 minutes and were allowed to quit at any time; the range of hours participants used Native Numbers (Native Brain, 2014) was from 5-12, for one to three days per week. Like the Dias (2016) study, in the current study, the number of days per week participants used Native Numbers varied and averaged to three days per week. Participants in the current study were also permitted to quit working at any time, although we encouraged participants to work for at least 10 minutes and announced when 10 minutes had elapsed. Each day we were at the site, the classroom teachers provided us names in advance for the treatment participants that they needed to work with for non-mathematics instruction (e.g., writing, reading, or group projects requiring intact groups of students). Due to the dynamic structure of this activity time, tracking the exact time participants entered or exited the room where Native Numbers' instruction occurred was difficult. Therefore, we recorded only whether a participant logged on to Native Numbers or not. The average time available for working was up to about 30 minutes, similar to that of Dias.
Also similar to Dias (2016), we designed the current study such that as each participant completed all 25 sets of activities in Native Numbers (Native Brain, 2014) to the 5 th level, we administered the post-tests individually, after which the participant transitioned to a business-as-usual control group status. Although we designed the study with a goal of a clear line of separation between when the wait-control group would switch to using Native Numbers, we explicitly planned for, and included in the IRB proposal, the possibility that some students in the first-treatment group would not reach the 5 th level on all 25 activities before the wait-control group entered into the treatment phase. We monitored the progress of the first treatment group and determined when to enter the wait-control group based on the number of days remaining in the academic year; that is, a point at which we could reasonably expect that all participants would have the same minimum number of days available in the treatment phase. As the study progressed, we set the minimum number of days at 22 (i.e., both groups had the opportunity to use Native Numbers for at least 22 days). The study's total elapsed time was 18 weeks, including holidays, field trips, professional development days, pre-testing, and post-testing.
One week before the minimum 22 days possible for the wait-control group to use Native Numbers (Native Brain, 2014) we assessed the wait-control group participants on motivation and facilitated the second assessment of the Number Sense Screener™ (Jordan et al., 2012) and the Preschool Assessment of Mathematical Language (Purpura & Logan, 2015). After completing the three assessments, similar to the first treatment group's entry, participants in the wait-control group transitioned to the treatment phase over three days. As each participant in the wait-control group reached the 5 th level on all 25 activities of Native Numbers, we administered the numeracy, mathematical language, and intrinsic motivation post-tests; then, these participants returned to business-as-usual center activities. Two weeks before the end of the school year, we began administering maintenance assessments of numeracy and mathematical language to the first treatment group participants who had completed Native Numbers a minimum of four weeks earlier. One week before the end of the school year, we administered the numeracy, mathematical language, and intrinsic motivation post-tests to the wait-control participants who had not yet finished all 25 activities (n = 6).

Fidelity
All researchers received training on administering and coding the assessments prior to the start of the study. We randomly selected 33% of the assessments for each of the three testing time points and coded fidelity of the researchers' implementation of the assessments using checklists based on the measures' scripted instructions. Fidelity was 99.8% for the numeracy and mathematical language assessments, and fidelity for the intrinsic motivation was 99.3%, with the modifications described earlier. Additionally, we double coded all assessment data during grading and when coding into the Excel spreadsheet, ensuring a research team member who did not facilitate the assessment conducted the second coding. We verified participant usage of Native Numbers (Native Brain, 2014) by cross-checking the attendance records provided by the teachers and by data collected on the teacher dashboard associated with Native Numbers.
The iPads belonged to the school site. Each participant was assigned an iPad for use throughout the day. To minimize the risk that a student could log on to Native Numbers (Native Brain, 2014) while one of the researchers was not present, we took daily screenshots of the dashboards to verify participants had not logged on. To further minimize the risk that participants might use Native Numbers when it was not their assigned treatment phase, we installed Native Numbers only on the iPads of the participants currently assigned to use Native Numbers, and we removed Native Numbers from each iPad after a participant completed all 25 activities to the 5 th level. According to the dashboards, none of the participants used Native Numbers outside of the time set for the study. Furthermore, results from a survey sent home with the consent forms indicated that none of the participants had used Native Numbers prior to starting the current study.

Analyses
Analyses included raw scores for all measures and, with the exception of a robust test of imputed data described later, were conducted using SPSS for Mac, Version 26. Syntax for all analyses are provided in Appendices E-G of the Supplementary Materials. Additionally, before describing the primary analyses, in the following section, we described the methods used to examine the scores of 13 participants, seven from the first-treatment group and six from the wait-control group, whose use of Native Numbers (Native Brain, 2014) varied compared to the other participants.
Preliminary Analysis: Difference in Days of Use -Seven participants from the first-treatment group had not completed all 25 activities to the 5 th level of Native Numbers (Native Brain, 2014) when it was time for the wait-control group to enter the treatment phase. Therefore, these participants continued to work, potentially introducing bias into the analysis. Importantly, the threat of bias was due to the difference in the number of days afforded these seven participants, not due to contamination. That is, the participants in the first-treatment group were not in both conditions at the same time.
Visual inspection of the dashboard data indicated that six participants in the wait-control group either did not complete all 25 activities (n = 2) or did not reach the 5 th level on one or more activities (n = 4). An examination of attendance records indicated that for 11 of these 13 participants, the discrepancies in the number of days using Native Numbers (Native Brain, 2014) were due to absences, participating in reading or writing assessments during the period offered to use Native Numbers, or work involving projects requiring intact groups (e.g., end-of-the-year presentations).
To assess the potential impact the additional elapsed time the seven first treatment group participants had to use Native Numbers (Native Brain, 2014) compared to the six wait-control group participants, we first inspected each participants' total number of days of use (see Table 1). Overall, the data suggested that the number of days of use was not a driving factor of changes from the pre-test to the post-test of numeracy. For example, only two of the seven participants in the first treatment group worked more than the 22 days afforded to both groups. Notably, one participant in the first-treatment group who worked 28 days, and two wait-control group participants who worked only 15 days, had the same pre-test scores and almost identical post-test scores; the two wait-control group participants' post-test scores were one point higher than the participant in the first-treatment group. To further analyze the extent to which these 13 participants varied by outcome scores, we conducted an independent sample t-test. The results indicated the means of the numeracy outcome scores for the seven participants in the first treatment group (M = 25.00, SD = 2.52) compared to the six in the wait-control group (M = 24.67, SD = 2.25), were not significantly different, t(1, 11) = 0.250, p = .81, 95% CI [-2.605, 3.272]. Given the lack of significant difference in outcome scores, we did not consider that the 13 participants presented a risk of bias based on the number of days they used Native Numbers (Native Brain, 2014).
Preliminary Analysis: Missing Data -Although the seven participants in the first treatment group did not appear to introduce bias, they were missing the third assessment, the maintenance assessment, because they did not have a minimum of four weeks remaining in the year to examine retention of academic gains. We considered two options for analyzing repeated measures for within-and between-group analyses: repeated measures mixed ANOVA (RMM ANOVA) and linear mixed model (LMM) regression (Verma, 2015). LMM's are robust to unbalanced designs containing missing data (e.g., Muth et al., 2016), whereas RMM ANOVAs are sensitive to unbalanced groups and SPSS will list-wise delete all data from subjects missing even one data point. On the other hand, while statisticians have proposed models for reporting significant effects for LMMs, not all statisticians agree on the procedures (e.g., Meteyard & Davies, 2020;Rights & Sterba, 2020). In keeping with Meteyard and Davies (2020) recommendation for reporting why a particular analysis was chosen over another, we chose to replace the missing data and run the analyses with a RMM ANOVA because (a) the number of participants in our study was small, thus risking non-convergence for all the possible decisions of random effects in a LMM, and (b) the point of the analyses was not to fit a best model for our data, per se, but to compare the differences in the magnitude of effects between the current study and that of the Dias (2016) study. We also considered two different methods of replacing the missing data of the seven participants in the first treat ment group's maintenance scores: multiple imputation and the Last Observation Carried Forward (LOCF). Generally, statisticians highly discourage using the LOCF; however, for the current study, we considered the LOCF the most parsimonious and valid procedure (e.g., Overall et al., 2009). First, the missing data were the maintenance scores for the seven first treatment group participants who did not have a full month remaining post-treatment, not data from their pre-test or post-test scores. Second, we did not need the maintenance scores to compare differences in the effect of treatment between the first group and the wait-control group after each had used Native Numbers (Native Brain, 2014). Last, we did not need the maintenance scores to compare the magnitude of differences in the effect between the treatment and control group in the Dias (2016) study to the effect between the first treatment group and the wait-control group in the current study prior to the wait-control group receiving treatment. We needed the missing data to (a) examine the first-treatment group's maintenance of gains after a minimum of one month and (b) to allow SPSS to run a RMM ANOVA without deleting the seven participants entirely.
Nonetheless, given the unfavorable view of statisticians to use the LOCF, as a robust test we ran five imputations and examined differences in the means and pooled means of the imputed data to the mean of the LOCF (see Appendix C in the Supplementary Materials for descriptive statistics). Although across the five imputations the means of the imputed data were slightly higher than the mean of the LOCF, the difference in pooled means were not significant: t(40) = 1.634, p = .110. The degree of freedom of the pooled result was corrected using an approximation suggested by van Ginkel and Kroonenberg (2014). Based on the results of the robust test and the justifications provide above, we retained the LOCF as the means for replacing the missing data on the third measure of the seven first treatment group participants when conducting the RMM ANOVA. Additionally, we removed these seven participants when examining the first treatment group's maintenance gains.
Preliminary Analysis: Wait-Control Group Pre-Test Scores -Due to the length of time from the first numeracy assessment to the second assessment, approximately six weeks, we used the second testing point as the wait-control group's pre-test score and considered the first assessment score as their baseline. Further justification for using the second assessment point as the wait-control group's pre-test was that results from a paired sample t-test indicated that the wait-control group had growth during the time between the first testing point and the second testing point, although the growth was not statistically significant: t(1, 21) = -1.47, p = .157, 95% CI [-0.738, 0.119]. Nonetheless, rather than taking an average of the wait-control group's first assessment and their second assessment, we considered their second test to be the most accurate reflection of their growth and used it as their pre-test.
Primary Analysis -We analyzed within-and between-group differences of the numeracy scores, across the three testing points, via a two-way RMM ANOVA. Two-way RMM ANOVAs are suitable for designs with two independent groups when looking at both between-group and within-group differences on a dependent variable measured repeatedly across both groups (Verma, 2015). We also conducted two separate t-tests to (a) examine the growth of the wait-control group after they used Native Numbers (Native Brain, 2014) and (b) to determine if the first-treatment group maintained any potential gains a minimum of one month after their post-test.
Initial descriptive statistics examining Q-Q Plots, Boxplots, and the Shapiro-Wilk test indicated the scores of the Preschool Assessment of Mathematical Language (Purpura & Logan, 2015) were not evenly distributed. At the pre-test, approximately 80% of the participants scored at or close to ceiling. Therefore, we conducted separate non-parametric tests for within-and between-group differences using Mann-Whitney U Tests for between-group analysis (e.g., Nachar, 2008) and Wilcoxon Sign tests (e.g., Harris & Hardin, 2013) to analyze within-group differences.
Similar to the Preschool Assessment of Mathematical Language (Purpura & Logan, 2015), the initial descriptive statistics of the Young Children's Academic Intrinsic Motivation Inventory (Gottfried, 1990) indicated a violation of assumptions of normal distribution for both reading and mathematics intrinsic motivation. Therefore, we conducted separate non-parametric tests for within-and between-group differences via a Mann-Whitney U Test for differences between groups and a Wilcoxon Sign test for differences within groups.

Results of the Current Study Compared to Dias (2016)
We included the same numeracy outcome measure and the same intrinsic motivation measure as Dias (2016): The Num ber Sense Screener™ (Jordan et al., 2012) and the Young Children's Academic Intrinsic Motivation Inventory (Gottfried, 1990). Based on the significant modifications we made when implementing the intrinsic motivation inventory, we did not compare the results from the current study to those reported by Dias (2016). Furthermore, the results from both the mathematical language assessment and the intrinsic motivation inventory indicated insignificant differences between the first-treatment group and the wait-control group from pre-test to post-test scores; therefore, we presented these findings first.

Intrinsic Motivation Results
We ran two Mann-Whitney U tests to determine if there were differences between the groups for intrinsic motivation for reading and intrinsic motivation for math. Our visual inspection indicated that although the distributions of scores for the two groups were not similar, differences between the groups' levels of intrinsic motivation for reading were not significantly different at the pre-test (U = 291.50, z = 0.607, p = .544), nor the post-test (U = 262.50, z = -0.033, p = .974). Similarly, between group differences for intrinsic motivation for math were not significantly different at the pre-test (U = 231.50, z = -7.20, p = .471) nor the post-test (U = 233.50, z = -0.691, p = .490).
We ran a series of Wilcoxon Signed Rank Tests to determine if there were significant changes from the pre-test to the post-test for reading motivation and for mathematics motivation within each group. For the first-treatment group, the median difference from the pre-test to the post-test for reading motivation was not significant (z = 1.513, p = .130); however, there was a statistically significant median increase in mathematics motivation from the pre-test to the post-test for mathematics motivation (z = 1.990, p = .047). For the wait-control group, the median differences from the pre-test to the post-test of reading motivation (z = 0.526 p = .599) and mathematics motivation (z = 1.709, p = .087) were not significant. Appendix D in the Supplementary Materials includes a table of the means, medians, and standard deviations of the intrinsic motivation analyses.

Mathematical Language Results
Initial descriptive statistics of the Preschool Assessment of Mathematical Language (Purpura & Logan, 2015) indicated that almost all the participants across both groups were at ceiling on the pre-test. Mann-Whitney U tests confirmed the median difference between the groups was not significant for either the first test, the second test, or the third test (U = 186.00, z = -1.834, p = .067; U = 281.50, z = 0.437, p = .662, U = 183.50, z = -0.119, p = .922, respectively). Similarly, the Wilcoxon Signed Rank tests also indicated that within-group differences between the first testing period and the second testing period were not significant for the first-treatment group (z = 0.000, p = 1.00. However, changes from the first test to the second test were statistically significant for the wait-control group (z = 2.289, p = .022); that is, changes from their baseline to their pre-test were significant, although the median change was less than one point (e.g., 0.59). Wilcoxon Signed Rank tests also indicated that within-group differences between the second testing period and the third testing period were not significant for the first-treatment group (z = 0.333, p = .739) or the wait-control group (z = 0.791, p = .429). Appendix D of the Supplementary Materials contains a table of the means, medians, and standard deviations of the language analyses.

Numeracy Results
Results of the RMM ANOVA indicated that the mean difference (0.723) of numeracy scores between the groups was not significant (F 1,44 = 0.918, p = .343, 95% CI [-0.798, 2.245]); however, there was a significant interaction between Group and Testing Periods (F 2, 88 = 6.432, p = .002). These results are in line with a wait-control design in which all participants receive the instruction, but at different points in time. That is, if both groups received treatment, collapsed across the testing periods, the between-group differences should not be different if the treatment was effective for both groups.
Follow up univariate analyses for each of the testing periods indicated the mean difference (0.193) between the first-treatment group's numeracy scores and the wait-control group's scores was not significant on the pre-test/baseline (F 1, 44 = 0.033, p = .857, 95% CI [-1.950, 2.336]). Similarly, the mean difference (1.176) on the third test, after removing the seven participants who did not have the third test, was not significant (F 1, 37 = 2.847, p = .100, 95% CI [-0.236, 2.589]). However, the mean difference (2.36) of the numeracy scores between the groups was significantly different on the second test, the post-test for the first-treatment group (F 1, 44 = 7.603, p = .008, η p 2 = 0.147, 95% CI [0.636, 4.091]; see Figure 1).

Scores of the Number Sense Screener™ by Group Across Time
Note. Participants were tested three times using the Number Sense Screener™ (Jordan et al., 2012). The first test (NSS1) served as the first-treatment group's pre-test and the wait-control group's baseline. The second test (NSS2) was the first-treatment group's post-test and the wait-control group's pre-test. The third test (NSS3) was the wait-control group's post-test and a maintenance test for the first-treatment group.
To determine the effect of using Native Numbers (Native Brain, 2014) for the wait-control group, we conducted a paired sample t-test from the second testing point to the third testing point, their pre-test to post-test period. Results indicated the use of Native Numbers elicited a mean increase of 2.364 (SE = 0.339) in the wait-control group's numeracy scores, t(6.973), p < .001, d = 1.48, 95% CI [1.659, 3.609]). Thus, after using Native Numbers both groups significantly increased their numeracy scores, as measured by the Number Sense Screener™ (Jordan et al., 2012).
Removing the seven participants who did not have a maintenance test, we conducted a t-test to examine if the first treatment group maintained their previous gains. The results indicated that the first-treatment group maintained their growth, and slightly increased their numeracy scores (M = 0.765, SE = 0.689), although the changes were not statistically significant, t(16) = 1.110, p = .283, 95% CI [0.696, 2.225]. See Table 2 for the descriptive statistics for the current study and Table 3 for the Dias (2016) study. This number reflects removing the seven participants from the first group who did not have the maintenance assessment. Note. NSS = The Number Sense Screener™ (Jordan et al., 2012).

Discussion
The purpose of this conceptual replication study was to examine changes in numeracy achievement, intrinsic motiva tion, and mathematics language when instruction was provided by Native Numbers (Native Brain, 2014), an ITS. As a conceptual replication, we aimed to use the same assessment instruments, grade level, number of classrooms, and participant demographics as the original study (Dias, 2016). For pragmatic reasons, we chose to conduct the study in a geographical location close to the authors. However, we manipulated five distinct aspects of our study: the time of year of the study, the inclusion of a wait-control group, the inclusion of an assessment of mathematical language, the setting within the school where participants used Native Numbers, and who facilitated and monitored the participants' assessments and use of Native Numbers. The last two changes came about through collaborative discussions with the participating teachers prior to the study's start and were included in the IRB proposal as potentially malleable components to change based on the teachers' preferences. Our first research question was whether we could replicate similar effects as Dias (2016) on numeracy outcomes (e.g., η p 2 = 0.622) given our changes in design. While we did not replicate the same effect size as Dias (2016), the effect size in the current study was moderate to high: η p 2 = 0.147. Within-group gains for the first-treatment group and the wait-control group were significant after receiving instruction via Native Numbers (Native Brain, 2014); and the first group maintained their achievement for at least one month.
From a design perspective, we matched participant demographics as closely as possible regarding the type of school (i.e., private) and socio-economic status. One plausible explanation for differences in effect sizes is that the studies occurred at different times of the year: fall versus spring. Initial pre-treatment scores differed between the Dias (2106) groups and our groups. The post-test scores for the Dias treatment group were close to the pre-test scores in our sample: M = 22.04 for Dias compared to M = 22.72 for the entire sample in the current study. Even without treatment, the participants in our entire sample had close to the same scores in the middle of February as the Dias treatment group had post-treatment in the fall. We inferred that kindergarten participants in our sample eventually had the same scores as the Dias' participants had in the fall; however, the participants in our study did not reach this level of achievement until well past the middle of the school year. We did not have the beginning of the year data for the current sample as a direct comparison to the Dias pre-test scores for the same time period, nor end of the year data for the Dias' participants; however, one implication is higher numeracy earlier in the year may allow for additional growth in numeracy for the remainder of the year.
Two additional changes in the current study from the Dias (2016) study that may explain differences in the effect sizes were who facilitated the supplemental numeracy activities and where the participants engaged in these supple mental activities. Specifically, having more students working on Native Numbers (Native Brain, 2014) in a computer lab-type setting, as in the current study, leaving a lower teacher-to-student ratio to work with the wait-control group, could have provided the wait-control group an advantage over the control group in the Dias study, as well as over the first-treatment group in the current study. The regular teacher-to-student ratio in our sample was 1:9 as opposed to the 1:19 ratio in the Dias study. With rare exception, only one of the researchers in the current study monitored all the treatment participants who were gathered in one room. Thus, the researcher-to-student ratio average was 1:27 for the treatment group, but the teacher-to-student ratio was 1:5 for the wait-control group. The teacher-to-student ratio for the Dias study participants remained 1:19 as the classroom teachers monitored both the treatment and the control groups at the same time. Despite the differences in the adult-to-participant ratios, both groups in the current study significantly increased their numeracy scores after using Native Numbers. Therefore, having more students together in a larger room did not appear to negatively impact the first treatment group's outcome scores compared to the wait-control group. Regardless, whether the differences in the magnitude of effect sizes were due to differences in the adult-to-student ratios is unknown.
Our second research question was whether we could replicate similar intrinsic motivation findings like those in the Dias (2016) study. Results of the current study indicated that the use of Native Numbers (Native Brain, 2014) did not significantly increase between-group differences for intrinsic motivation for mathematics or reading. The only significant change in motivation was for the first treatment group for mathematics. Given the small sample size and lack of between-group differences, we do not have an explanation for why the first treatment group increased their intrinsic motivation for mathematics and the wait-control group did not.
One thing to note is that, as an indicator of being intrinsically motivated to read or to engage in mathematics, the highest score possible was 33. A score of 22 would indicate that a participant was neither intrinsically motivated nor unmotivated to read or do mathematics; an 11 would indicate a participant was generally unmotivated to read or engage in mathematics. For both groups, the medians for reading and mathematics leaned towards being more intrinsically motivated; for example, the lowest median (24.50) was on the pre-test for mathematics. The significant modifications we made when implementing the motivation inventory and the difficulty the participants had interpreting how to respond to the negative questions may explain this null finding. However, Dias (2016) noted that even though the treatment group's intrinsic motivation scores increased significantly compared to the control group, intrinsic motivation scores did not predict numeracy outcomes.
Our last research question looked at mathematical language changes, a measure not included in the Dias (2016) study. We did not find differences in mathematical language achievement after using Native Numbers (Native Brain, 2014). The null results are partially explained by the fact that nearly all participants were at or near ceiling at the pre test. The Preschool Assessment of Mathematical Language (Purpura & Logan, 2015) was designed for pre-kindergarten, not kindergarten. Based on the results of our analyses, we cannot report an impact on mathematical language after using Native Numbers.

Limitations
The current study had notable limitations preventing generalization to different contexts and demographics. Important ly, both studies included small sample sizes and were conducted in private schools with predominately Caucasian participants. Additionally, unlike the teacher-to-student ratio in the Dias (2016) study (i.e., 1:19), our participants had an average teacher-to-student ratio of 1:9; a ratio unlikely to occur in public school settings in the United States. Furthermore, a researcher monitored the use of Native Numbers (Native Brain, 2014) in the current study, compared the participants' teachers in the Dias (2016) study. While the ratio of adults-to-participants provided a proof of concept that Native Numbers can be utilized in a computer lab type of setting, or with entire classrooms at one time, how the role of a researcher monitoring use of Native Numbers impacted outcomes (i.e., compared to a lab instructor or classroom teacher) is unknown and was not a research question explored in the current study. Importantly, however, the researchers did not provide additional instruction or help other than problem-solving technology issues or providing encouragement for participants to think through a problem when they were stuck.
Next, the mathematical language measure used in the current study was not designed for kindergarten. We caution readers not to use the results from our study to compare the results to other mathematical language studies using the Preschool Assessment of Mathematical Language (Purpura & Logan, 2015) with pre-kindergarten students, the age group for which the assessment was designed. Likewise, the Young Children's Academic Intrinsic Motivation Inventory (Gottfried, 1990) was not designed for kindergarten. The lack of grade-level-appropriate standardized measures for both intrinsic motivation and mathematical language in kindergarten impedes our understanding of these two roles in numerical cognition for kindergarten aged participants.
Additionally, one of the primary limitations preventing direct comparison of the current study to the Dias (2016) study was that all the participants in the Dias study completed all 25 Native Numbers' (Native Brain, 2014) activities to the highest level; the participants in the current study did not. The time of year we started the study limited the number of days available for instruction. Starting earlier in the year may have allowed all participants to finish. However, this may have confounded the variable of interest related to allowing participants in the current study more time attending formal schooling before using Native Numbers. Nonetheless, including extra days in the current study was not possible.
Because of the highly adaptive nature of ITSs, we planned in advance for the possibility that some students would finish quickly, while others would take more time. We considered several different options. For example, before entering the wait-control group into treatment status, we could have administered the post-assessments to the first treatment group participants who had not yet completed all 25 activities to the fifth level and then moved them into the business-as-usual status. Alternatively, we could have kept the participants who had finished and had received post-tests in the same room with the participants who were still working on Native Numbers (Native Brain, 2014). A third option was to shorten the study's duration by setting a firm number of days for each phase. We did not consider any of the above options as being consistent with an ecologically valid design.
First, we were concerned about a potentially demotivating impact for both those who finished early and those still working if we kept the students who had finished together in the same room. If we had kept the treatment participants together, we would have needed an additional activity to keep those who finished early occupied, introducing a confounding variable. Conversely, removing the participants who had not yet finished might have been seen as unfair if they were engaged and wanted to finish. We planned for and allowed participants to choose how long they wanted to work, not only to maintain ecological validity but also to acknowledge and honor the participants' agency.
We also chose not to limit the number of days participants used Native Numbers (Native Brain, 2014) as this would have: (a) prohibited a direct comparison of outcome scores from the current study to those of the Dias (2016) study, and (b) ignored results from other studies that included adaptive software where the authors noted that shorter lengths of time for the interventions might not have been adequate for some participants (e.g., Garduno, 2016;Salminen et al., 2015). At the design phase of this study, we explicitly chose to monitor how the first treatment group participants progressed and then decided when to enter the wait-control group based on the number of days remaining in the year. Importantly, the conceptual changes for this replication were driven by the question of whether the time of year was a potential variable for differences in effect sizes, if effect sizes were replicated. The addition of the wait-control group was an additional measure of validity for any significant outcomes between the first treatment group and the wait-control group but was not required as a replication of the Dias study.
Furthermore, we theorized that if participants in the first treatment group needed additional time, these students could potentially represent a subset of students at risk for MD, and analyzing their data could potentially inform future studies of MD. Results from our study did not indicate risk for MD. The outcome scores for the first treatment group participants who had not finished all activities to the 5 th level by the 22 nd day were not significantly different from the outcome scores of the wait-control group participants who did not finish the 25 activities to the highest level. Except for two participants in the first group who needed additional days, the discrepancies in the number of days available for working and the number of days students actually worked were due to absences, the classroom teachers pulling the students to conduct other classroom assessments, or participants working in groups with teachers for class-wide projects requiring intact groups.
While we monitored the days students used Native Numbers (Native Brain, 2014) and cross-checked the days with the attendance records and daily screen shots, we did not track the number of minutes each student used Native Numbers. This was an intentional decision on our part based partially on the logistics (e.g., the lack of researchers available for notation and IRB requirements for videotaping participants) and partially due to the dynamics of the naturalistic setting: students interacting with their peers showing their progress (e.g., Lim, 2012), the freedom to move about the room or to leave to get a drink of water, and entering the room later or leaving sooner to work with their classroom teachers on other projects. Likewise, we did not ask the developers to extract the number of minutes from the log-file data because we would have needed additional data to verify the number of minutes logged into Native Numbers equated to the minutes the students were actively engaged (e.g., Bond & Bedenlier, 2019;Fincham et al., 2019).
Beyond the limitations related to tracking minutes of active engagement, the range of the difference in the number of days the participants required to reach the highest level across all the activities is an example of the logistical chal lenges of incorporating highly individualized, adaptive instruction (e.g., Bondie et al., 2019), whether by practitioners or researchers. That is, determining how to balance instruction at the group level, while also providing instruction for students who finish well ahead, or behind, their peers, is a problem of practice. On the other hand, highly adaptive software theoretically affords the individualized instruction that students need.

Future Directions for Classroom Use and Research
While the current study contributes to the scant literature of replication studies of numerical cognition, based on the limitations described above, we outlined several possibilities for future research: the timing of and need for additional support while using technology, the generalization of concepts, and specific future numeracy research possibilities using Native Numbers (Native Brain, 2014), other ITS, or other software.
First, the use of Native Numbers (Native Brain, 2014), regardless of the time of year initiated, brought the majority of students' levels of achievement up to the highest percentiles, according to the Number Sense Screener™ (Jordan et al., 2012). Where Dias (2016) reported a between-group difference of η p 2 = 0.622 at the beginning of the school year, our between group-difference in late spring was well above a moderate effect (η p 2 = 0.147). Although none of the participants in our study would have met the definition of MD if using a <30% cut-off score, some participants were close to that cut-score at the pre-test (e.g., the 34 th percentile) and required additional time to complete all the activities. Thinking about who is at risk for MD, and that the National Assessment of Educational Progress (National Center for Education Statistics [NAEP], 2017) report indicated 60% of 4 th grade students in the United States performed at or below proficiency, the use of a cut score below the 30th percentile for determining who needs additional assistance may not capture all students needing support. Students at the edge of any cut score, or performing above but not to proficiency, are not considered at risk for MD, yet may not have a solid conceptual understanding of numeracy. The 60% of students in the United States performing at or below proficiency likely needed additional, or different, instruction well before 4 th grade, even if their performance was not below the 30 th percentile, or other percentiles used for tiered response to intervention support.
Together, the findings from the current study and the Dias (2016) study provided evidence that the use of Native Numbers (Native Brain, 2014) as a supplement to instruction can increase numeracy achievement, even for students performing at or above-average performance; although, the sample sizes of both studies do not support a definite conclusion. One question of interest is how the instruction of Native Numbers, which only includes quantities of 1-9, allowed participants to significantly increase their numeracy scores on the Number Sense Screener™ (Jordan et al., 2012) for items that assessed content not provided by Native Numbers. The Number Sense Screener™ contains 29 items; ten directly aligned with the content and range of quantities instructed by Native Numbers. At the middle of the year pre-test, 40% of the kindergarten participants performed above the 90 th percentile. At the end of the study that number increased to 74% and only one participant fell below the 80 th percentile. Tools such as Native Numbers, which provide intelligent tutoring, may offer what Fuchs, Fuchs, and Compton (2012) described as "secondary prevention" or "booster lessons" (Bryant et al., 2008). Dias (2016) reported changes from the pre-test to the post-test for the treatment group moved participants from the 68 th percentile to 91st percentile on the Number Sense Screener™. Implementing highly individualized technology brought the scores of most participants using Native Numbers, in both studies, up to the highest levels of performance. However, the highly individualized tutoring also introduced challenges related to the number of days each student needed to reach the highest level in Native Numbers. Future studies are warranted to investigate Native Numbers' pedagogy for the potential mechanisms of generalization to concepts not included in Native Numbers' instruction. Additionally, different study designs such as staggered entry or single-subject studies with multiple baselines may afford ways to navigate the logistical challenges of research incorporating adaptive software.
A second research opportunity is to analyze the log-file data produced when using technology within an experi ment (see a suggested framework by Heffernan & Heffernan, 2014). All technology can track interactions, affording fine-grained analyses of performance (e.g., Heffernan & Heffernan, 2014;Larkin & Milford, 2018;Rau et al., 2013). Native Numbers (Native Brain, 2014) has features that are well suited to the types of micro-genetic studies that log-file data can provide such as how individuals vary when acquiring flexible use of different representations of quantities and the relationships between shared underlying numeric structures (e.g., Goldwater & Schalk, 2016;Parviainen, 2019). For example, Native Numbers includes three sets of five activities using the different representations for quantity recognition, ordinality, and relational language. Studies of relational language and ordinality with young participants are underexplored areas of numerical cognition, as are studies using number rods, which are introduced as the first activity (e.g., Coles & Sinclair, 2018;Goffin & Ansari, 2016;Schalk et al., 2016;Morsanyi et al., 2018;Venenciano et al., 2012).
Additionally, Dias (2016) noted participants required between 1000-3000 tasks to reach mastery of the 25 sets of numeracy activities. In our study, that range was approximately 1200-7700. While these numbers may seem incredibly high, at face value, the numbers align to a finding in a recent approximate number system training study with adult participants conducted by Cochrane et al. (2019). Some (expert) adults required more than 20 days of training and over 8000 trails to increase approximate number system performance. Cochran et al. noted the long, slow progression of change for a novel task. In a study with prekindergarten participants using a numeracy app, Broda et al. (2019) noted that when participants completed 400 practice opportunities, their accuracy increased 80%. In another study, although not with technology, Doabler et al. (2019) examined the ratio of teacher instruction-to-student practice opportunities and noted that a ratio of 1:3 significantly increased kindergarten participants' achievement; that is for every one instance of teacher delivered explicit instruction, participants needed three opportunities to practice.
Why would the number of tasks or practice attempts matter? A commonly accepted theory in early numeracy is that acquiring mastery of the counting system takes years (e.g., Geary et al., 2019). However, how many years and how many tasks (i.e., attempts, practices) an individual needs are unknown. Like the literature on mathematical relational vocabulary in kindergarten, the literature on how much practice students need to master numerical skills is virtually si lent. Nonetheless, schools implement pacing guides and standards of practice assuming a normal progression of mastery without a clear understanding of how long a normal progression to mastery takes. McLean and Rusconi (2014) argued, "no empirically-based normative developmental trajectory for mathematics learning has been yet established, and the non-problematic range of variation for age-appropriate levels is currently untested" (p. 1). Educators, policymakers, and researchers would benefit from knowing how many tasks, and therefore, how much time or differentiation is required to reach fluency, as well as how long this information remains in long-term memory. The use of log-file data from software like Native Numbers (Native Brain, 2014) can help us answer questions like those above because of the fine-grained levels of data produced while learning (e.g., R. Baker, 2016; U.S. Department of Education, Office of Educational Technology, 2012).
However, tracking the number of tasks alone is likely insufficient to capture the process of learning, just as tracking the number of minutes the program is running is insufficient to capture learning. Despite the significant gains in the current study, not all participants had the same amount of growth. As discussed previously, we did not examine possible reasons why some participants did not finish all 25 activities to the highest level within the 22-day timeframe. Research on how kindergarten students, both at risk and not at risk for MD, present with different degrees of difficulty is necessary to help researchers and teachers understand possible correlations between general domain abilities, self-determination, resilience, and goal orientations; specifically in the domain of mathematics, but also when using technology (e.g., Higgins et al., 2019;Hohol et al., 2017;Keller & Libertus, 2015;Matusz et al., 2019;Peng et al., 2016;White et al., 2017). To understand the process of learning likely requires mixed-methods studies incorporating multiple data sources (e.g., Baccaglini-Frank et al., 2020;Fincham et al., 2019). One option is to use the screen capturing capabilities of iPads while the students are using Native Numbers (Native Brain, 2014), or other software. Audio and video data may provide information such as meta-cognitive self-talk, social dialogue, and potentially different problem solving strategies for the specific numeracy skills designed into Native Numbers. Audio and video data captured by the iPads during use also reduces the need for the number of observers and additional equipment required for an observational study.
Finally, neither our study nor the Dias (2016) study considered how using an alternative software program, whether an ITS or not, may affect differences in outcome scores, persistence, or motivation. Native Numbers (Native Brain, 2014) provides valuable lab space to investigate different types of numerical representation, different instructional methods such as blocked or distributed learning and reversibility problems (e.g., Barzagar Nazari & Ebersbach, 2018), as well as relational language and ordinality. Given the above opportunities to advance knowledge of early numeracy, future studies are warranted using either adapted versions of the current Native Numbers model, incorporating another software into the design of a study, or using another ITS.
Classroom teachers need viable, economically feasible means of providing individualized instruction. However, not all software, or other curricular resources, are designed with substantial pedagogy, and the role of the teacher is critical to the success of any curriculum, supplemental or otherwise (e.g., Nye, 2014;Wen et al., 2019). The Dias (2016) study was an initial confirmation of Native Numbers' (Native Brain, 2014) effectiveness, and before embarking on the type of research described above, we sought to replicate effectiveness. Although our effect size was not as substantial as that of the Dias study, both studies' effect sizes were moderate to extremely high for a small homogeneous sample of 73 participants combined. Nonetheless, the Dias study and our conceptual replication provided emerging evidence that using Native Numbers for supplemental instruction holds promise as a useful tool, a one-to-one tutor in the room, and as a tool for researching numerical cognition.

Funding:
The authors have no funding to report.

Acknowledgments:
The authors have no additional (i.e., non-financial) support to report.

Competing Interests:
The authors have declared that no competing interests exist.

Data Availability:
A synthetic data set is available online (Grimes et al., 2021a). Submission of actual data was prohibited by the Institutional Review Board approval for this study.

Supplementary Materials
The Supplementary Materials contain the following items (for access see Index of Supplementary Materials below): • A deidentified synthetic data set in Excel format • The following five appendices: 1. Objectives and sequence of instruction in Native Numbers 2. List of activites the active control group completed during math centers 3. Descriptive statistics and syntax of analyses of multiple imputation of missing data 4. Descriptive statistics for motivation and mathematical language measures 5. Syntax of analyses for the differences in performane of the 13 particpants who did not complete all 25 activities to the highest level and syntax for the number sense screener