^{a}

^{b}

^{c}

^{d}

^{e}

^{f}

^{c}

^{c}

^{c}

Intelligent Tutoring Systems are a genre of highly adaptive software providing individualized instruction. The current study was a conceptual replication of a previous randomized control trial that incorporated the intelligent tutoring system Native Numbers, a program designed for early numeracy instruction. As a conceptual replication, we kept the method of instruction, the demographics, the number of kindergarten classrooms (n = 3), and the same numeracy and intrinsic motivation screeners as the original study. We changed the time of year of instruction, changed the control group to a wait-control group, added a maintenance assessment for the first group of participants, and included a mathematical language assessment. Analysis of within- and between-group differences using repeated measures ANOVA indicated gains of numeracy were significant only after using Native Numbers (Partial Eta Square = 0.147). Results of intrinsic motivation and mathematical language were not significant. The effect size of numeracy achievement did not reach that of the original study (Partial Eta Square = 0.622). Here, we compared the two studies, discussed plausible reasons for differences in the magnitude of effect sizes, and provided suggestions for future research.

Results from syntheses and meta-analyses of research on difficulties with mathematics indicate that many students who struggle with mathematics at the end of kindergarten continue to struggle throughout their schooling (e.g.,

In the United States, the National Association of Education of Young Children (

The architectural designs of educational software include, for example, models wherein a user works through a series of items in a linear fashion, models that allow users to adjust instruction based on preference, and models that adapt instruction (

An intelligent tutoring system (ITS) is a distinctly different type of adaptive instructional software designed with complex models that embed individualized instruction in tandem with dynamic assessment (

“computes inferences from student responses, constructs either a persistent multidimensional model of the student’s psychological states (such as subject matter knowledge, learning strategies, motivations, or emotions) or locates the student’s current psychological state in a multidimensional domain model...” and “uses the student modeling functions…to adapt one or more of the tutoring functions” (p. 902).

Critically, the student model is the software design element that differentiates an ITS from other educational instructional software (

One potential advantage of utilizing an ITS for instruction is that elements required within the software’s architecture inherently include practices some researchers have recognized as evidence-based: explicit instruction, adapting instruction to individual needs based on assessment, and providing feedback (e.g.,

Despite the potential benefits of using an ITS for mathematics instruction, none of the studies included in the published meta-analyses and syntheses of ITS (e.g.,

The location of the

Based on the need for early numeracy instruction to prevent MD (

The purpose of the current study was to conduct a conceptual replication of the

implementing the treatment in the spring

providing the same treatment to a wait-control group

assessing maintenance of academic gains over time

assessing mathematical language

The fourth change to the

With the changes to the

How do gains in students’ numeracy compare to the original study?

How do the outcomes of intrinsic motivation compare to the original study?

Did students’ outcomes on a mathematics language screener change as a result of using Native Numbers (

In the United States, the age of kindergarten attendance is around the age of 5. However, kindergarten is not mandatory across all states. As previously described, the

Participants in the current study (

Although treatment for both studies occurred during regular center-based activity time, prevalent in kindergarten classrooms in the United States, the grouping of students into classrooms was distinctly different in the current study. Specifically, rather than having three separate kindergarten

Based on the dynamics of the center-based instruction time, the building’s design, and the classroom teachers’ preference, treatment in the current study occurred in one of the larger multi-purpose rooms, rather than in individual classrooms as in the

Similar to ^{st} through 3^{rd} grade as a downward progression of the Children’s Academic Intrinsic Motivation Inventory (

Similar to ^{st} grade and ^{rd} grade (^{st} grade WJ composite scores correlated (^{st} grade WJ composite scores (^{rd} grade WJ composite scores (

To our knowledge, a validated stand-alone kindergarten assessment of mathematical language in English does not exist. Therefore, we chose the Preschool Assessment of Mathematical Language (

In both the

For both the

Native Numbers (

Native Numbers (^{®}. Users learn to map between quantities for each of the different representations through blocked practice and interleaved practice. Also, as part of the Demonstrate Mastery activities, some activities are designed with reversibility (e.g.,

Participants in the

The current study was approved by the University of Texas at Austin Institutional Review Board (IRB). Teachers and parents/guardians provided written consent, and students provided verbal and written assent. Prior to beginning the assessments, we used an online random number generator to randomize the participants by their primary groups into the first-treatment group or the wait-control group. For ethical and pragmatic reasons, all students had the opportunity to use Native Numbers (

In the

Assessment of intrinsic motivation via the Young Children’s Academic Intrinsic Motivation Inventory (

Concerned about participant testing fatigue and reliability of responses, we removed the overall schoolwork sub-measure and we removed two questions each from the reading and mathematics sub-measures (the same questions for both content areas): one indicating negative intrinsic motivation and one indicating positive intrinsic motivation. Also, due to time constraints and researcher availability, we chose to implement the Young Children’s Academic Intrinsic Motivation Inventory (

Also similar to ^{th} level, we administered the post-tests individually, after which the participant transitioned to a business-as-usual control group status. Although we designed the study with a goal of a clear line of separation between when the wait-control group would switch to using Native Numbers, we explicitly planned for, and included in the IRB proposal, the possibility that some students in the first-treatment group would not reach the 5^{th} level on all 25 activities before the wait-control group entered into the treatment phase. We monitored the progress of the first treatment group and determined when to enter the wait-control group based on the number of days remaining in the academic year; that is, a point at which we could reasonably expect that all participants would have the same minimum number of days available in the treatment phase. As the study progressed, we set the minimum number of days at 22 (i.e., both groups had the opportunity to use Native Numbers for at least 22 days). The study’s total elapsed time was 18 weeks, including holidays, field trips, professional development days, pre-testing, and post-testing.

One week before the minimum 22 days possible for the wait-control group to use Native Numbers (^{th} level on all 25 activities of Native Numbers, we administered the numeracy, mathematical language, and intrinsic motivation post-tests; then, these participants returned to business-as-usual center activities. Two weeks before the end of the school year, we began administering maintenance assessments of numeracy and mathematical language to the first treatment group participants who had completed Native Numbers a minimum of four weeks earlier. One week before the end of the school year, we administered the numeracy, mathematical language, and intrinsic motivation post-tests to the wait-control participants who had not yet finished all 25 activities (

All researchers received training on administering and coding the assessments prior to the start of the study. We randomly selected 33% of the assessments for each of the three testing time points and coded fidelity of the researchers’ implementation of the assessments using checklists based on the measures’ scripted instructions. Fidelity was 99.8% for the numeracy and mathematical language assessments, and fidelity for the intrinsic motivation was 99.3%, with the modifications described earlier. Additionally, we double coded all assessment data during grading and when coding into the Excel spreadsheet, ensuring a research team member who did not facilitate the assessment conducted the second coding. We verified participant usage of Native Numbers (

The iPads belonged to the school site. Each participant was assigned an iPad for use throughout the day. To minimize the risk that a student could log on to Native Numbers (^{th} level. According to the dashboards, none of the participants used Native Numbers outside of the time set for the study. Furthermore, results from a survey sent home with the consent forms indicated that none of the participants had used Native Numbers prior to starting the current study.

Analyses included raw scores for all measures and, with the exception of a robust test of imputed data described later, were conducted using SPSS for Mac, Version 26. Syntax for all analyses are provided in Appendices E–G of the

Seven participants from the first-treatment group had not completed all 25 activities to the 5^{th} level of Native Numbers (

Visual inspection of the dashboard data indicated that six participants in the wait-control group either did not complete all 25 activities (^{th} level on one or more activities (

To assess the potential impact the additional elapsed time the seven first treatment group participants had to use Native Numbers (

Group: Student | NSS^{a} Pre-Test Score |
NSS Post-Test Score | NSS Change | Days at 22^{a} |
Total Days | Days > 22^{b} |
---|---|---|---|---|---|---|

First: 1 | 19 | 22 | 3 | 17 | 28 | 6 |

First: 2 | 20 | 26 | 6 | 19 | 26 | 4 |

First: 3 | 17 | 23 | 6 | 21 | 22 | 0 |

First: 4 | 20 | 27 | 7 | 19 | 22 | 0 |

First: 5 | 18 | 29 | 11 | 18 | 21 | 0 |

First: 6 | 14 | 25 | 11 | 14 | 20 | 0 |

First: 7 | 22 | 23 | 1 | 12 | 17 | 0 |

Wait:1 | 25 | 27 | 2 | 17 | 17 | 0 |

Wait: 2 | 18 | 22 | 4 | 16 | 16 | 0 |

Wait: 3 | 25 | 26 | 1 | 15 | 15 | 0 |

Wait: 4 | 19 | 23 | 4 | 15 | 15 | 0 |

Wait: 5 | 19 | 23 | 4 | 15 | 15 | 0 |

Wait: 6 | 22 | 27 | 5 | 11 | 11 | 0 |

^{th} level by the 22^{nd} day; NSS = The Number Sense Screener™ (^{a}Days worked up through the 22^{nd} day available.

^{b}Days worked above 22.

To further analyze the extent to which these 13 participants varied by outcome scores, we conducted an independent sample

Although the seven participants in the first treatment group did not appear to introduce bias, they were missing the third assessment, the maintenance assessment, because they did not have a minimum of four weeks remaining in the year to examine retention of academic gains. We considered two options for analyzing repeated measures for within- and between-group analyses: repeated measures mixed ANOVA (RMM ANOVA) and linear mixed model (LMM) regression (

We also considered two different methods of replacing the missing data of the seven participants in the first treatment group’s maintenance scores: multiple imputation and the Last Observation Carried Forward (LOCF). Generally, statisticians highly discourage using the LOCF; however, for the current study, we considered the LOCF the most parsimonious and valid procedure (e.g.,

Nonetheless, given the unfavorable view of statisticians to use the LOCF, as a robust test we ran five imputations and examined differences in the means and pooled means of the imputed data to the mean of the LOCF (see Appendix C in the

Due to the length of time from the first numeracy assessment to the second assessment, approximately six weeks, we used the second testing point as the wait-control group’s pre-test score and considered the first assessment score as their baseline. Further justification for using the second assessment point as the wait-control group’s pre-test was that results from a paired sample

We analyzed within- and between-group differences of the numeracy scores, across the three testing points, via a two-way RMM ANOVA. Two-way RMM ANOVAs are suitable for designs with two independent groups when looking at both between-group and within-group differences on a dependent variable measured repeatedly across both groups (

Initial descriptive statistics examining Q-Q Plots, Boxplots, and the Shapiro-Wilk test indicated the scores of the Preschool Assessment of Mathematical Language (

Similar to the Preschool Assessment of Mathematical Language (

We included the same numeracy outcome measure and the same intrinsic motivation measure as

We ran two Mann-Whitney U tests to determine if there were differences between the groups for intrinsic motivation for reading and intrinsic motivation for math. Our visual inspection indicated that although the distributions of scores for the two groups were not similar, differences between the groups’ levels of intrinsic motivation for reading were not significantly different at the pre-test (

We ran a series of Wilcoxon Signed Rank Tests to determine if there were significant changes from the pre-test to the post-test for reading motivation and for mathematics motivation within each group. For the first-treatment group, the median difference from the pre-test to the post-test for reading motivation was not significant (

Initial descriptive statistics of the Preschool Assessment of Mathematical Language (

Results of the RMM ANOVA indicated that the mean difference (0.723) of numeracy scores between the groups was not significant (_{1,44} = 0.918, _{2,} _{88} = 6.432,

Follow up univariate analyses for each of the testing periods indicated the mean difference (0.193) between the first-treatment group’s numeracy scores and the wait-control group’s scores was not significant on the pre-test/baseline (_{1, 44} = 0.033, _{1, 37} = 2.847, _{1, 44} = 7.603,

To determine the effect of using Native Numbers (

Removing the seven participants who did not have a maintenance test, we conducted a

Group / Test | ||||
---|---|---|---|---|

First Group Pre-Test | 24 | 22.63 | 3.62 | 0.74 |

Wait-Control Baseline | 22 | 22.82 | 3.58 | 0.76 |

First Group Post-Test | 24 | 26.00 | 2.36 | 0.48 |

Wait-Control Pre-Test | 22 | 23.64 | 3.40 | 0.73 |

First Group Maintenance | 17^{a} |
27.18^{a} |
1.94 | 0.47 |

Wait-Control Post-Test | 22 | 26.00 | 2.31 | 0.49 |

^{a}This number reflects removing the seven participants from the first group who did not have the maintenance assessment.

Group / Testing | |||
---|---|---|---|

Treatment Group Pre-Test | 27 | 16.63 | 4.84 |

Control Group Pre-Test | 30 | 19.05 | 5.10 |

Treatment Group Post-Test | 27 | 22.04 | 3.87 |

Control Group Post-Test | 30 | 18.87 | 5.13 |

The purpose of this conceptual replication study was to examine changes in numeracy achievement, intrinsic motivation, and mathematics language when instruction was provided by Native Numbers (

Our first research question was whether we could replicate similar effects as

From a design perspective, we matched participant demographics as closely as possible regarding the type of school (i.e., private) and socio-economic status. One plausible explanation for differences in effect sizes is that the studies occurred at different times of the year: fall versus spring. Initial pre-treatment scores differed between the Dias (2106) groups and our groups. The post-test scores for the Dias treatment group were close to the pre-test scores in our sample:

Two additional changes in the current study from the

Our second research question was whether we could replicate similar intrinsic motivation findings like those in the

One thing to note is that, as an indicator of being intrinsically motivated to read or to engage in mathematics, the highest score possible was 33. A score of 22 would indicate that a participant was neither intrinsically motivated nor unmotivated to read or do mathematics; an 11 would indicate a participant was generally unmotivated to read or engage in mathematics. For both groups, the medians for reading and mathematics leaned towards being more intrinsically motivated; for example, the lowest median (24.50) was on the pre-test for mathematics. The significant modifications we made when implementing the motivation inventory and the difficulty the participants had interpreting how to respond to the negative questions may explain this null finding. However,

Our last research question looked at mathematical language changes, a measure not included in the

The current study had notable limitations preventing generalization to different contexts and demographics. Importantly, both studies included small sample sizes and were conducted in private schools with predominately Caucasian participants. Additionally, unlike the teacher-to-student ratio in the

Next, the mathematical language measure used in the current study was not designed for kindergarten. We caution readers not to use the results from our study to compare the results to other mathematical language studies using the Preschool Assessment of Mathematical Language (

Additionally, one of the primary limitations preventing direct comparison of the current study to the

Because of the highly adaptive nature of ITSs, we planned in advance for the possibility that some students would finish quickly, while others would take more time. We considered several different options. For example, before entering the wait-control group into treatment status, we could have administered the post-assessments to the first treatment group participants who had not yet completed all 25 activities to the fifth level and then moved them into the business-as-usual status. Alternatively, we could have kept the participants who had finished and had received post-tests in the same room with the participants who were still working on Native Numbers (

First, we were concerned about a potentially demotivating impact for both those who finished early and those still working if we kept the students who had finished together in the same room. If we had kept the treatment participants together, we would have needed an additional activity to keep those who finished early occupied, introducing a confounding variable. Conversely, removing the participants who had not yet finished might have been seen as unfair if they were engaged and wanted to finish. We planned for and allowed participants to choose how long they wanted to work, not only to maintain ecological validity but also to acknowledge and honor the participants’ agency.

We also chose not to limit the number of days participants used Native Numbers (

Furthermore, we theorized that if participants in the first treatment group needed additional time, these students could potentially represent a subset of students at risk for MD, and analyzing their data could potentially inform future studies of MD. Results from our study did not indicate risk for MD. The outcome scores for the first treatment group participants who had not finished all activities to the 5^{th} level by the 22^{nd} day were not significantly different from the outcome scores of the wait-control group participants who did not finish the 25 activities to the highest level. Except for two participants in the first group who needed additional days, the discrepancies in the number of days

While we monitored the days students used Native Numbers (

Beyond the limitations related to tracking minutes of active engagement, the range of the difference in the number of days the participants required to reach the highest level across all the activities is an example of the logistical challenges of incorporating highly individualized, adaptive instruction (e.g.,

While the current study contributes to the scant literature of replication studies of numerical cognition, based on the limitations described above, we outlined several possibilities for future research: the timing of and need for additional support while using technology, the generalization of concepts, and specific future numeracy research possibilities using Native Numbers (

First, the use of Native Numbers (^{th} percentile) and required additional time to complete all the activities. Thinking about ^{th} grade students in the United States performed at or below proficiency, the use of a cut score below the 30th percentile for determining who needs additional assistance may not capture all students needing support. Students at the edge of any cut score, or performing above but not to proficiency, are not considered at risk for MD, yet may not have a solid conceptual understanding of numeracy. The 60% of students in the United States performing at or below proficiency likely needed additional, or different, instruction well before 4^{th} grade, even if their performance was not below the 30^{th} percentile, or other percentiles used for tiered response to intervention support.

Together, the findings from the current study and the ^{th} percentile. At the end of the study that number increased to 74% and only one participant fell below the 80^{th} percentile. Tools such as Native Numbers, which provide intelligent tutoring, may offer what ^{th} percentile to 91st percentile on the Number Sense Screener™. Implementing highly individualized technology brought the scores of most participants using Native Numbers, in both studies, up to the highest levels of performance. However, the highly individualized tutoring also introduced challenges related to the number of days each student needed to reach the highest level in Native Numbers. Future studies are warranted to investigate Native Numbers’ pedagogy for the potential mechanisms of generalization to concepts not included in Native Numbers’ instruction. Additionally, different study designs such as staggered entry or single-subject studies with multiple baselines may afford ways to navigate the logistical challenges of research incorporating adaptive software.

A second research opportunity is to analyze the log-file data produced when using technology within an experiment (see a suggested framework by

Additionally,

Why would the number of tasks or practice attempts matter? A commonly accepted theory in early numeracy is that acquiring mastery of the counting system takes years (e.g.,

However, tracking the number of tasks alone is likely insufficient to capture the process of learning, just as tracking the number of minutes the program is running is insufficient to capture learning. Despite the significant gains in the current study, not all participants had the same amount of growth. As discussed previously, we did not examine possible reasons why some participants did not finish all 25 activities to the highest level within the 22-day timeframe. Research on how kindergarten students, both at risk and not at risk for MD, present with different degrees of difficulty is necessary to help researchers and teachers understand possible correlations between general domain abilities, self-determination, resilience, and goal orientations; specifically in the domain of mathematics, but also when using technology (e.g.,

Finally, neither our study nor the

Classroom teachers need viable, economically feasible means of providing individualized instruction. However, not all software, or other curricular resources, are designed with substantial pedagogy, and the role of the teacher is critical to the success of any curriculum, supplemental or otherwise (e.g.,

A synthetic data set is available online (

The Supplementary Materials contain the following items (for access see

Objectives and sequence of instruction in Native Numbers

List of activites the active control group completed during math centers

Descriptive statistics and syntax of analyses of multiple imputation of missing data

Descriptive statistics for motivation and mathematical language measures

Syntax of analyses for the differences in performane of the 13 particpants who did not complete all 25 activities to the highest level and syntax for the number sense screener

The authors have no funding to report.

The authors have declared that no competing interests exist.

The authors have no additional (i.e., non-financial) support to report.