Reciprocal associations between executive function and academic achievement: A conceptual replication of Schmitt et al. (2017)

The goal of the current study was to conduct a conceptual replication of the reciprocal associations between executive function (EF) and academic achievement reported in Schmitt et al. (2017). Using two independent samples (N (STAR) = 279, and N (Pathways) = 277), we examined whether the patterns of associations between EF and achievement across preschool and kindergarten reported in Schmitt et al. (2017) replicated using the same model specifications, similar EF and achievement measures, and across a similar developmental age period. Consistent with original findings, EF predicted subsequent math achievement in both samples. Specifically, in the STAR sample, EF predicted math achievement from preschool to kindergarten, and kindergarten to first grade. In the Pathways sample, EF at kindergarten predicted both math and literacy achievement in first grade. However, contrary to the original findings, we were unable to replicate the bidirectional associations between math achievement and EF in either of the replication samples. Overall, the current conceptual replication has revealed that bidirectional associations between EF and academic skills might not be robust to slight differences in EF measures and number of measurement occasions, which has implications for our understanding of the development EF and academic skills across early childhood. The present findings underscore the need for more standardization in both measurement and modeling approaches – without which the inconsistency of findings in published studies may continue across this area of research.

its support of learning and adaptation in early school settings (e.g., Blair, 2002;Zelazo, Blair, & Willoughby, 2016). Specifically, children's ability to process and manipulate information, inhibit automatic and potentially inappropriate responses to the environment, and direct their attention to appropriate tasks has been shown to be particularly useful in early learning settings (e.g., Morrison, Ponitz, & McClelland, 2010). Collectively, this research suggests that strength ening children's EF skills could enhance their math and literacy development during the early and formative years of schooling (Blair & Diamond, 2008;Blair & Razza, 2007;McClelland et al., 2007). As such, large-scale interventions have been designed to target EF skills through educational practices and comprehensive curricula programs (e.g., Tools of the Mind; Bodrova & Leong, 2001; Promoting Alternative Thinking Strategies, Kusche et al., 1994;Chicago School Readiness Program;Raver et al., 2008).

Executive Function and Academic Achievement
Recent work, however, has cast doubt on the causal links between early EF skills and children's academic development (e.g., Jacob & Parkinson, 2015). Longitudinal studies, for example, examining bidirectional links between EF skills and academic achievement have challenged the predominant unidirectional perspective of EF supporting academic skill development (e.g., Cameron, Kim, Duncan, Becker, & McClelland, 2019;Fuhs, Nesbitt, Farran, & Dong, 2014;McKinnon & Blair, 2019;Meixner, Warner, Lensing, Schiefele, & Elsner, 2019;Miller-Cotto & Byrnes, 2020;Welsh, Nix, Blair, Bierman, & Nelson, 2010). This work has leveraged the availability of multiple time-points of data to describe reciprocal relations between EF and achievement over time, testing whether these constructs co-develop or are directional in nature during the early years of schooling. In many cases, autoregressive cross-lagged panel (ARCL) models were used to test whether domain general cognitive abilities, such as EF, prospectively predict domain specific abilities, such as academic achievement -or the degree to which cognitive abilities and academic skills co-develop (mutually influence each other) over time (see Peng & Kievit, 2020 for review). This research is often referred to as the theory of mutualism or co-development between EF and academic skills across time.
Findings from these studies using the concept of mutualism have been mixed. For example, Welsh and colleagues (2010) found bidirectional relations between EF and numeracy skills (but not literacy skills) during preschool (M age = 4.49 years), whereas other work has shown that EF prospectively predicts math and literacy achievement from preschool to kindergarten (Fuhs et al., 2014). Additionally, studies examining reciprocal relations using more than two time-points have yielded different directional patterns across early development. For example, Schmitt, Geldhof, Purpura, Duncan, and McClelland (2017) found bidirectional relations between EF and math achievement across the preschool school year (M age = 4.70 years), and unidirectional associations from EF to math achievement across kindergarten (M age = 5.70 years). Conversely, McKinnon and Blair (2019) reported bidirectional associations between EF and math skills across kindergarten (M age = 5.75 years), as well as from kindergarten to first grade.

The Present Study
Given the inconsistencies in the literature on the role of EF and academic skills and the continued emphasis on interventions of EF to improve academic outcomes, it is important to try to replicate the findings of the co-development of these constructs. Therefore, the goal of the present study is to conduct a conceptual replication of the Schmitt et al. (2017) study that found important reciprocal relations between EF and academic achievement outcomes by leveraging data from two independent longitudinal studies. We chose to replicate Schmitt et al. (2017) for several reasons. First, the overlap in the EF and academic achievement measures used across the three samples allowed us to replicate the Schmitt et al. (2017) study using similar measures of EF and academic skills. Second, the timing of measurement (i.e., preschool through the end of kindergarten) is similar across the three samples, which will allow us to test the concept of mutualism (co-development) between EF and academic skills across a similar developmental window (i.e., 4.5 years -6.5 years old). Finally, the two independent samples included a sufficient number of EF measures to allow for similar model specifications (e.g., latent variable modeling) as Schmitt and colleagues (2017). Schmitt et al. (2017) examined longitudinal relations between EF, math, and literacy using ARCL modeling. In the original investigation, Schmitt and colleagues (2017) reported bidirectional associations between EF and math achievement, but not literacy, across the preschool year. However, EF prospectively predicted literacy achievement from the spring of preschool to the fall of kindergarten. Further, they found that EF prospectively predicted math, but not literacy achievement across the kindergarten school year. Based on these findings, we expect to see bidirectional associations between EF and math and literacy achievement from preschool to kindergarten (STAR sample) and that EF would prospectively predict math from kindergarten to first grade (STAR and Pathways sample).

Schmitt et al. (2017) Dataset
Data were collected on a total of 435 children in the Pacific Northwest, U.S. The study consisted of four waves of data collection; children were assessed in the fall of preschool, spring of preschool, fall of kindergarten, and spring of kindergarten. On average, children were 4.70 years old (SD = 0.30) at the beginning of the study, and 51% were male. This sample consisted of 63% White children, 19% Latino/Hispanic children, 13% multiracial children, 3% Asian/Pacific Islander children, and 2% other ethnicities. In the fall of preschool, 55% of children were enrolled in Head Start and 15% were primarily Spanish speakers. At each wave, children was assessed on a battery of EF, literacy, and math measures.
Children were recruited from schools using a convenience sampling approach, such that, schools and children that were accessible and willing to participate were included in the study. Parents of children signed a written informed consent letter agreeing for their child to participate in the study. Children were assessed in two to three sessions that lasted 10 to 15 minutes each. For more information about this sample, refer to Schmitt et al. (2017).

STAR Dataset
This project was an extension of a larger longitudinal project on trajectories of early academic development. Data were collected on a total of 278 children in a Southeastern U.S. city. The study consisted of three waves of data collection; children were assessed in preschool, kindergarten, and first grade. On average, children were 4.67 years old (SD = 0.42) at the beginning of the study, and 55% were male. This sample consisted of 60% White children, 28% Black children, 2% Asian children, and 10% multiracial children. The sample broadly represented the region in which the children were recruited. All children had no known developmental disorders.
Children were recruited from libraries, daycare centers, and local establishments. Data collection took place as laboratory visits which lasted approximately two hours. During these visits, children participated a number of tasks that assessed cognitive and emotional development. Each child was assessed on a battery of executive function and achievement measures by a trained experimenter. Parents received monetary compensation for their time, and children selected a small toy at the completion of the visit. All procedures were approved by the university institutional review board.

Pathways Dataset
This project was an extension of a larger longitudinal project studying the effect of schooling on executive function development. A total of 367 children participated in the larger longitudinal project; however, 88 children were recruited in either first or second grade, and were therefore excluded from the current sample. Thus, the full sample for the current study consisted of 279 children attending seven elementary schools in Midwestern U.S. cities. The sample included three cohorts of children who were assessed across the fall and winter of kindergarten and first grade. On average, children were 5.38 years old (SD = 0.10) when first tested, 47% were male. Although child level race, ethnicity, and socioeconomic status was not collected, all children were recruited from racially and socioeconomically diverse schools. Schools included in this sample served children from a broad range of socioeconomic backgrounds based on school-wide percentages of free or reduced-price lunch (FRPL; 2% -71.9%).
Children in this sample were recruited from schools using a convenience sampling approach, similar to Schmitt et al. (2017), such that schools and children that were accessible and willing to participate were included in the study. Parents of children signed a written informed consent letter agreeing for their child to participate in the study. Children were individually assessed in schools outside their classrooms for a 45-minute period. During these assessments, each child was assessed on a battery of executive function and achievement measures by a trained experimenter. The order and versions of assessments were counterbalanced, as there were two different orderings of assessments and two different versions of each of the assessments. Children received a bookmark with stickers at the completion of the visit. All procedures were approved by the university institutional review board.

Measures Executive Function
A variety of children's executive function skills were assessed in each sample. Both STAR and Pathways samples included executive function measures of working memory and inhibitory control, as well as an additional measure. The additional measure in the STAR sample included a cognitive flexibility measure, and the additional measure in the Pathways sample included one global executive function measure. See Table 1 for a summary of overlapping variables across all datasets.

Schmitt et al. (2017) Executive Function
Auditory Working Memory -Children's working memory was measured using the Auditory Working Memory subtest from the Woodcock-Johnson III Tests of Cognitive Abilities (Woodcock, McGrew, & Mather, 2001). Participants were instructed to repeat back to the experimenter things and numbers in a specific order. An overall accuracy score was calculated by adding children's correct responses (each correct trial = 1 point). See Schmitt et al. (2017) for more information on this task.
Simon Says -Children's inhibitory control was measured using the Simon Says task (Carlson, 2005;Strommen, 1973). The experimenter asked children to perform an action only if the experimenter says, "Simon says", otherwise the child should remain still. For more information on how this task was scored, see Schmitt et al. (2017).
Card Sort -Children's cognitive flexibility was measured using a Card Sort task similar to the traditional Dimensional Change Card Sort task (Blackwell, Cepeda, & Munakata, 2009;Frye, Zelazo, & Palfai, 1995;Zelazo, 2006). The experi menter asked children to sort colored picture cards of a dog, fish, or bird on the basis of three dimensions: color, shape, and size. See Schmitt et al. (2017) for more information on this task.

Head-Toes-Knees-Shoulders (HTKS) -
The HTKS task was used to measure all of children's executive function skills through gross motor responses: working memory, inhibitory control, and cognitive flexibility (McClelland & Cameron, 2012;McClelland et al., 2014). Children were told they were going to play a game in which they must do the opposite of what the examiner's directions say, varying from touching your head, toes, knees, or shoulders. For example, if the trained examiner said, "touch your head" children were expected to touch their toes. The task grows in difficulty across three sections of questions in which the rules change. If children responded incorrectly, they were given a score of 0. If children responded correctly, they were given a score of 2, and if children self-corrected their response, they were given a score of 1. For more information on this task, see Schmitt et al. (2017).

STAR Executive Function
Numbers Reversed -Children's working memory capacity was measured using the Numbers Reversed subtest of The Woodcock-Johnson III (Woodcock et al., 2001). Participants were instructed to listen to the experimenter recite a string of numbers (beginning with two numbers and gradually increasing) and then repeat the numbers in reverse order. An overall accuracy score was calculated by adding children's correct responses (each correct trial = 1 point).

Go/No-Go -
A computer-based Go/No-Go paradigm was used to assess children's inhibitory control and sustained attention. Children were asked to press a button each time they saw an animal, except for when they saw a dog (Lahat, Todd, Mahy, Lau, & Zelazo, 2010). There were a total of 144 trials (75% Go). A discriminability index (d' = Z(Correct/Hit) -Z (Incorrect/False Alarm)) was used to assess the participants' ability to distinguish signals from noise (Stanislaw & Todorov, 1999).

Dimensional Change Card Sort (DCCS) -
Cognitive flexibility (also known as task shifting) was measured using a computerized version of The Dimensional Change Card Sort task (Espinet, Anderson, & Zelazo, 2012). In the pre-switch block, children were asked to sort the stimuli according to their shape (15 trials). In the post-switch block, children were asked to sort the stimuli according to color (30 trials). The post-switch was followed by a "borders" block in which children were instructed to sort stimuli on one dimension (color) if the picture had a border around it but the other dimension (shape) if the picture did not have a border (12 trials). Percent accuracy was computed for each block and weighted averages were created as follows: Preschool: 33.3% pre-switch, 66.7% post-switch; kindergarten & 1st grade: 25% pre-switch, 50% post-switch, 25% borders. Higher scores indicated greater cognitive flexibility. Several outcome measures from this dataset were previously published, thus for further information regarding these measures, see Isbell, Calkins, Swingler, and Leerkes (2018), Isbell, Calkins, Cole, Swingler, and Leerkes (2019), Zeytinoglu, Leerkes, Swingler, and Calkins (2017) and Zeytinoglu, Calkins, and Leerkes (2019). However, this publication differs from the previous publications as colleagues (2018, 2019) only used the Go/No-Go task and the WJ subtests in their work, and even though colleagues (2017, 2019) used the EF measures, they did not investigate links between EF and academic outcomes.

Pathways Executive Function
Digit Span Backward -Children's working memory was assessed using the Digit Span-Backward subtest of the McCarthy Scales for Children's Abilities (McCarthy, 1972). Participants were read a sequence of numbers (beginning with two numbers and gradually increasing), and asked to repeat the same sequence back to the examiner in reverse order.
Go/No-Go -A Go-No Go paradigm called the Zoo Game (Grammer, Carrasco, Gehring, & Morrison, 2014) was used to assess children's inhibitory control and sustained attention. Children were told to press a button each time they saw an animal, except for when they saw an orangutan. There were a total of 320 trials (75% Go). A discriminability index (d' = Z(Correct/Hit) -Z (Incorrect/False Alarm)) was used to assess participants' ability to distinguish signals from noise (Stanislaw & Todorov, 1999). Larger values of d' indicate better task performance.

Academic Achievement
Mathematics -The standardized Applied Problems subtest of the Woodcock-Johnson III Tests of Achievement (WJ AP; Woodcock, McGrew, & Mather, 2001) was used to assess individual mathematical skills. The Applied Problems task assesses children on numerous early math skills such as counting, representational arithmetic, abstract arithmetic, and the ability to read a clock. Items increase in difficulty as children progress through the task, and basal and ceiling levels are determined for each student. The WJ-AP was counterbalanced (Form A or Form B) such that children would be less likely to remember questions from the year before and completed by children at all waves.

Literacy -
The standardized Letter-Word Identification subtest of the Woodcock-Johnson III Tests of Achievement (WJ-LWID; Woodcock, McGrew, & Mather, 2001) was used to assess children's literacy skills. The WJ-LWID subtest assessed children's ability to read letters and words in both expressive and receptive language. Items in this task were also ranked in order of difficulty, and basal and ceiling levels were determined for each student. This task was completed by children at all waves.

Covariates
In an effort to replicate the results from the original study as closely as possible, we also considered which covariates should be included. The original analyses, Schmitt et al. (2017), included English Language Learners (ELL), Head Start enrollment, and age as covariates. The STAR dataset did not have Head Start enrollment, but there were 11 children for whom English was not the primary language spoken at home. Language spoken at home did not relate to the independent or dependent variables in our analyses (p = .09-.99) and did not predict attrition in kindergarten (χ 2 = 140, p =.93) or grade 1 assessments (χ 2 =1.03, p = .60). Based on these preliminary analyses and the small percentage of children who were ELL (4%), we did not include ELL as a covariate in our analyses. In the replication analyses, the Pathways dataset did not include ELL or Head Start enrollment. Thus, age was the only covariate common to all three datasets (STAR, Pathways, and Schmitt et al. (2017)).

Analytic Approach
Similar to Schmitt et al. (2017), all analyses were conducted in Mplus (Muthén & Muthén, 1998. We used full information maximum likelihood estimation (FIML) to handle missing data to reduce potential bias in the parameter estimates (Enders & Bandalos, 2001). This permitted the inclusion of all participants with data on one or more variables. Due to the missing data and potential departures from multivariate normality, the model was estimated using a robust maximum likelihood estimator (MLR). We used ARCL models to examine longitudinal relations between EF, math, and literacy achievement. In both datasets, we first specified an initial longitudinal confirmatory factor analysis (CFA) model, controlling for participants' age at first testing to examine the fit and factor loadings of the latent EF factors at each wave. In STAR, the EF latent factors consisted of Numbers Reversed, DCCS, and Go-No/Go. In the initial CFA models, we scaled all latent factors by fixing the latent means to zero and latent variables to one. In Pathways, the EF latent factors consisted of HTKS, Digit Span Backward, and Go-No/Go.
Next, we examined the measurement invariance of the EF construct across waves to understand if EF was measured in a consistent way across time. Consistent with Schmitt et al. (2017), we first tested weak factorial invariance (also called metric invariance) to examine the degree to which the specific EF indicators (e.g., working memory) loaded on to the EF constructs equally across time. This was tested by equating the same EF indicators' loadings across wave. Next, we tested strong factorial invariance (also called scalar invariance) to understand whether the EF construct was measured on the same interval or ratio across time. This was tested by equating the same EF indicators' intercepts across waves. Although there are different strategies for evaluating measurement invariance, we followed the approach used in Schmitt et al. (2017) given our goal of replicating their study. Thus, consistent with Schmitt and colleagues, the models were compared to the initial CFA model and were rejected if the Comparative Fit Index (CFI) decreased by more than .01 (Chen, 2007), and if full weak or full strong factorial invariance led to a decrease in model fit, partial measurement invariance was tested by freeing at most two parameters. For a further discussion, see Little (1997) and Schmitt et al. (2017).
After establishing longitudinal measurement invariance for the EF factors, we specified a longitudinal structural equation model (SEM) with math and literacy as manifest variables. The SEM includes single-lag stability regressions and single-lag cross-construct regressions. Executive function factors, math, and literacy were all allowed to covary. All models included child age at initial assessment as time-invariant covariates. Similar to Schmitt et al. (2017), model fit was adequate based on appropriate fit statistics including Comparative Fit Index (CFI) and Tucker Lewis Index (TLI) between 0.95 and 1.00 (Hu & Bentler, 1999;Kline, 2005), and the Root Mean Square Error of Approximation (RMSEA) less than 0.06 (Hu & Bentler, 1999).

Sensitivity Power Analyses
The STAR and Pathways samples are existing datasets, thus, a sensitivity power analysis was used to calculate the minimally detectable effect sizes (MDES) given the sample sizes for all statistical analyses (Cribbie, Beribisky, & Alter, 2019;Giner-Sorolla et al., 2019). This provides some context for why we see different rates of significance across the studies for given effect sizes. In the STAR dataset, with three latent variables, six observed variables, 277 participants, α = .05, and power (1-β) = .80, the sensitivity power analysis suggested that the MDES was 0.21 (Soper, 2020). In the Pathways dataset, with two latent variables, four observed variables 279 participants, α = .05, and power (1-β) = .80, the sensitivity power analysis suggested that the MDES was 0.18 (Soper, 2020). Whereas Schmitt et al. (2017) reported an effect size as small as .11 as significant in their sample, our sensitivity power analyses suggest that neither the STAR nor Pathways datasets are powered to detect effect sizes under .18 as significant.

Results
Descriptive statistics for all three datasets are presented in Tables 2 and 3. Correlation tables for both the STAR and Pathways studies can be found in the Appendix.

Confirmatory Factor Analyses and Measurement Invariance
In both datasets, the initial CFA of the EF variables fit the data well (see Tables 4 and 5), such that all factor loadings were above 0.40 (Stevens, 1992), and all factor loadings were statistically significant for all indicators at each wave (all ps < .05). The initial tests of weak factorial invariance substantially decreased model fit in both datasets (STAR: Δ CFI = .02; Pathways: Δ CFI = .03). Thus, for each dataset, we assessed partial weak and partial strong factorial invariance across EF factors.
In the STAR dataset, freely estimating the numbers reversed factor loading for wave 1 resulted in a model that supported partial weak invariance (Δ CFI = -.00; Δ BIC = 12.99). Although freeing indicators other than numbers reversed could also result in partial weak invariance, one reason why we chose to free this indicator was because its standardized factor loading seemed to be smaller (β = .48) than the factor loadings at the subsequent waves (β = .68 & .55) in the unconditional model (see Table 4). This was likely because there was less variability in the distribution of numbers reversed in the first wave compared to the subsequent two waves, given the difficulty of this task for some preschoolers. Moreover, the numbers reversed and DCCS intercepts were freely estimated across waves, resulting in partial strong invariance (Δ CFI = -.00; Δ BIC = 13.22). Numbers reversed was freely estimated to be consistent with the weak invariance decision. In addition to numbers reversed, we chose to freely estimate DCCS intercepts because the DCCS task at the second and third waves also included the "borders" block, whereas this block was not included in the first wave and thus this change has likely affected the scale of the latent variable across time. In the Pathways dataset, freely estimating the Zoo Go/No-Go factor loading for wave 1 resulted in a model that supported partial weak invariance (Δ CFI = .00; Δ BIC = 4.70). Similar to the STAR sample, we chose to free this indicator because its standardized factor loading seemed to be smaller (β = .41) than the factor loadings at the subsequent waves (β = .82 & .51) in the unconditional model (see Table 4). Moreover, we freely estimated the Zoo Go/No-Go intercept across waves to be consistent with the weak invariance decision, thus resulting in partial strong invariance (Δ CFI = .01; Δ BIC = 11.74). Thus, both STAR and Pathways samples demonstrated partial weak and partial strong measurement invariance in EF latent construct across time, suggesting that the EF constructs showed an acceptable level of measurement equivalence across time.

Autoregressive Cross-Lagged Models
ARCL models were tested and examined using the EF latent variables and math and literacy variables. Syntax is available at https://osf.io/5twgv/. The structural component and standardized results of the final models are presented in Figure 1 and 2. A synopsis of standardized coefficients for both autoregressive and cross-lagged paths are summarized across all datasets in Table 6. For the structural component and standardized results of the original article, see Figure B.3 in Schmitt et al. (2017).

STAR ARCL
The longitudinal ARCL fit the data well, χ 2 = 106.60, df = 69, CFI = .98, TLI = .96, RMSEA = .04. First, within wave correlations demonstrated a similar pattern as mentioned above in which the first wave correlations between math, literacy, and EF were large and statistically significant. In particular, the correlation between EF and math achievement was very large (r = .87). However, the later within-wave correlations were smaller. Second, the factor stabilities for literacy and EF were all significant and above β = .49. The factor stability for math was close to zero from wave one to wave two, but then moderate and statistically significant from wave two to wave three (β = .30, SE = .14, p = .03). Third, the cross-lagged paths demonstrated that higher executive functioning predicted higher math achievement from wave one to wave two, as well as wave two to wave three. Higher executive functioning was not associated with higher literacy achievement at either wave. Furthermore, higher literacy achievement at the second wave predicted higher math achievement at the third wave, but not at the wave prior. Finally, higher math was not significantly associated with higher executive functioning at wave two or wave three.

Pathways ARCL
Results for the Pathways sample also suggested the longitudinal ARCL model fit the data well, χ 2 = 51.94, df = 27, CFI = .96, TLI = .93, RMSEA = .06. First, results suggest the first wave correlations between math, literacy, and EF were large and statistically significant. However, the second wave correlations were much smaller. Second, the factor stabilities for literacy and EF were above .50 for both factors, whereas the stability for math was close to zero. Third, when considering the cross-lagged paths, higher executive functioning at wave one predicted higher literacy and math achievement at wave two. Further, higher literacy achievement at wave one predicted higher math achievement at wave two. In contrast, higher math at the first wave was essentially unrelated to literacy achievement at wave two (β = -.01, SE = .11, p = .92), and not significantly related to wave two executive functioning.

Discussion
The goal of the present study was to provide a conceptual replication of the ARCL findings reported by Schmitt and colleagues (2017). The original study revealed partial measurement invariance of the EF latent variables over time, strong autoregressive paths across all constructs, and found bidirectional relations between EF and math and literacy achievement from preschool to kindergarten, and unidirectional relations from EF to math across fall and spring of kindergarten. Consistent with Schmitt et al. (2017), we found that the EF latent variables demonstrated partial measurement invariance over time across both replication datasets. Further, the autoregressive estimates across EF latent and observed literacy variables across all datasets was moderate in strength, suggesting longitudinal construct stability. However, the stability of math achievement from preschool to kindergarten (STAR study) and from kindergarten to first grade (Pathways study) was close to zero in the ARCL models, which is a significant departure from the stability estimates reported in the original study. Additionally, across both samples, we failed to replicate the bidirectional pattern of findings between EF, math, and literacy achievement reported in the original study. Specifically, in the STAR sample, we found unidirectional associations from EF to math achievement, such that preschool EF prospectively predicted math (but not literacy) achievement at kindergarten, and kindergarten EF predicted math (but not literacy) achievement at first grade. In the Pathways sample, we replicated the observed unidirectional relations found in Schmitt et al. (2017), such that kindergarten EF prospectively predicted math and literacy achievement at the beginning of first grade. Math and literacy skills, however, did not predict the EF latent factor in either sample.
The inconsistent pattern of findings in our replication study mimic the inconsistencies that are found in the EF and academic skills literature during early childhood. In addition to the results reported by Schmitt et al. (2017), several recent studies have also found evidence of the co-development of EF and academic achievement (e.g., Cameron et al., 2019;McKinnon & Blair, 2019;Meixner et al., 2019;Miller-Cotto & Byrnes, 2020;Welsh et al., 2010). However, others (Fuhs et al., 2014;Willoughby et al., 2019), including the current replication have demonstrated that EF prospectively predicts academic skills, which is inconsistent with the theory of mutualism between cognitive and academic skills across early development (e.g., Peng & Kievit, 2020).
There are several potential reasons for the inconsistency of findings across this body of research and the present replication study in particular. First, it is not clear the degree to which EF constructs measured across studies are captur ing the same underlying skills (see Morrison & Grammer, 2016, for review). For example, although many studies adopt a tripartite model of EF, consisting of inhibitory control, working memory/updating, and cognitive flexibility/shifting (e.g., Diamond, 2013;Miyake et al., 2000), others include a broader range of EF-related constructs, such as impulsivity, inattention, and behavioral self-control (e.g., Fuhs et al., 2014) or do not include full coverage of subcomponents considered part of the broader EF umbrella (e.g., Miller-Cotto & Byrnes, 2020;Willoughby et al., 2019). Although we relied on a similar set of EF measures used in Schmitt et al. (2017) for the present replication study, there was not a complete overlap in the measures used across the three samples. For example, the Schmitt et al. (2017) study used the Simon Says task, whereas the STAR and Pathways studies included two child friendly versions of a Go/No-Go task to measure inhibitory control. Therefore, it is possible that the lack of uniformity of EF measurement approaches may partially explain the inconsistent findings in the present study, which is also considered a noted limitation in the area of early childhood EF research more broadly (see Morrison & Grammer, 2016).
Another potential reason we were unable to replicate the bidirectional cross-lagged associations between EF and academic achievement reported in the original paper could be due to the number of time-points across the three samples. Specifically, while children in Schmitt et al. (2017) were assessed within a narrow window during the fall and spring of their preschool and kindergarten years, children in the replication samples were tested once a year from preschool to first grade (STAR) and during kindergarten and first grade (Pathways). This difference in timing could have also contributed to our inability to replicate the bidirectional cross-lagged effects from preschool to kindergarten, as children in the Schmitt et al. (2017) study were sampled across the entire school year. The more fine-grained sampling procedure in Schmitt et al. (2017) allowed them to identify changes in the relations between EF and math achievement (i.e., mutual relations at preschool, EF → math at the end of kindergarten), which they attribute to changes in the complexity of math instruction during the kindergarten school year. Our sampling approach did not allow for a direct test of this hypothesis, as neither the STAR nor Pathways study included fall and spring testing occasions across preschool and kindergarten. These inconsistent findings suggest that directional patterns of relations between EF and academic skills might, in part, depend on the number of measurement occasions studied across early development.
Finally, differences in sample size and sample characteristics could also explain our inability to replicate the Schmitt et al. (2017) findings. It is possible that subtle effects were not detected due to a lack of statistical power in our replication studies. Specifically, our sensitivity analyses suggested that effect sizes under .21 were not detectable in the STAR dataset due to our sample size, which may have affected our interpretations of the bidirectional relations between EF and math achievement. Future research using larger replication samples is needed to understand whether our inability to replicate the bidirectional cross-lagged associations reported in the Schmitt et al. (2017) study is due to sample size restrictions.
Further, differences in sample characteristics across the three samples might have also contributed to the inconsis tent replication findings. It could be that the Schmitt et al. (2017) sample included children with systematically different background characteristics. Given the importance of individual, demographic and family-level influences during early childhood, and their associations with EF and academic outcomes (e.g., Hackman et al., 2015;Sarsour et al., 2011). Thus, there may be possible untested moderators, or confounding variables, across samples that could explain the mechanisms involved in the co-development of EF and achievement.
Furthermore, the non-significant autoregression estimates of math achievement across the first two time-points was surprising given that the early math measures (WJ-AP) used in the current study have been extensively validated, age normed, and show excellent test-retest reliability across development (Woodcock, McGrew, & Mather, 2001). Both the Schmitt et al. (2017) study and the current replication included a working memory measure that involved verbal numerical tasks. The use of a working memory measure that includes numerical naming may introduce a confounding variable in these studies and may play a role in the associations between EF and math. However, the different and unequal sources of measurement error across latent and observed variables does not permit a fair comparison between the EF and math achievement variables in the current study, as latent factor variance is considered independent from residual measurement error, whereas observed variables include both true score and error variance (Bollen, 2002;Rhemtulla et al., 2020). The large correlations observed within waves indicate that there is considerable shared variance among the latent EF factor and math achievement (STAR r = .87; Pathways r = .75). It is possible that when EF and math were modeled simultaneously, the EF latent variable accounted for the variance in math at the next time-point, thus contributing to the decrease in the autoregression of math achievement. Thus, the non-significant math achievement autoregression estimate could be due to the presence of the numerical working memory measures, or the EF latent variable in the ARCL model and raises questions about the utility of modeling EF as a latent when examining bidirectional associations using manifest math achievement variables.
In sum, the results of the present conceptual replication were mixed. We replicated the results of the EF measure ment model and longitudinal stability estimates of EF and literacy across two independent samples. However, we could not replicate the cross-lagged pattern of findings reported in the original study. The lack of bidirectional relations between EF and math achievement in both replication samples does not lend support to the theory of mutualism between these two constructs. However, the current conceptual replication has also revealed that bidirectional associa tions between EF and academic skills might not be robust to slight differences in EF measurement and number of measurement occasions, which might have contributed to the mixed findings in the literature, and has implications for our understanding of the development EF and academic skills across early childhood. Although this study cannot shed light on the best way to characterize associations between EF and academic achievement across early development, these findings underscore the need for more standardization in both measurement and modeling approaches -without which the inconsistency of findings may continue across this area of research.