A Comparison of Methods for Assessing Performance on the Number Line Estimation Task

The debate about how to characterize performance on the number line estimation (NLE) task has yielded a diverse set of accuracy measures. These accuracy measures include characterizing performance by deviation from the correct score with percent absolute error (PAE), modeling the shape of responses via the logarithmic-to-linear shift, and modeling the strategy use via the cyclical power model (one and two cycle). In the present study, accuracy on a symbolic NLE task was examined using phenotypic and quantitative genetic analyses of all four measurements. Data were collected from a same-sex twin sample at ages 12 and 15 (N = 150 pairs) as part of the Western Reserve Reading and Math Project. Linear mixed-effect models were used to compare how well the four NLE accuracy measures predicted math achievement, as measured by the Woodcock Johnson-III Fluency, Calculation, and Applied Problems subtests, after cognitive ability was controlled. NLE accuracy measures were not related to Fluency or Calculation after cognitive ability was controlled, but all NLE accuracy measures were related to Applied Problems at 12 and 15 years old. Although theories about what the NLE task measures have been contested in the literature, the relationship between NLE accuracy and achievement did not differ regardless of the type of accuracy measure used. In addition, the estimates for genetic and environmental influences were proportionately similar across the NLE accuracy measures. Overall, all proposed measures of accuracy in the present sample appear appropriate for prediction of math achievement in adolescents.

remain.At the present time, two theoretical approaches attempt to explain patterns of performance on numeric estimation tasks.The first emphasizes the importance of an innate internal mental representation of magnitude which takes the form of a number line (Dehaene, 1992).A second perspective suggests that the development and application of strategies determines performance (Barth & Paladino, 2011).Depending on the theoretical orientation, the recommended approach for quantifying numeric estimation accuracy is different, thus leading to a separate but related practical question in regard to how performance on numeric estimation tasks is best represented.For example, the predicted shape of response is different depending on whether performance on the task is driven by an innate magnitude representation (predicts shift from logarithmic to linear response) or by strategy (predicts minimal error around the endpoints and midpoint).The practical disagreement about which measure to use is relevant because higher performance on the numeric magnitude estimation tasks has been associated with higher levels of mathematical achievement in elementary school children when average error was used (Ashcraft & Moore, 2012;Booth & Siegler, 2006;Fazio et al., 2014;Geary, 2011;Sasanguie, De Smedt, Defever, & Reynvoet, 2012;Sasanguie, Göbel, Moll, Smets, & Reynvoet, 2013;Siegler & Booth, 2004).
The alternate, theory-driven, measures have been used to characterize performance, but a comparison of the predictive validity of the accuracy measures on math achievement outcomes in an adolescent sample has not been conducted.
More broadly, when symbolic numbers (e.g., "8") were presented to both adults and children, regardless of modality, magnitude representation was automatically activated, even in cases where a task did not require a judgment of magnitude (Wood, Willmes, Nuerk, & Fischer, 2008).Internal magnitude representation is ordinal, with numbers cognitively organized so that smaller numbers are associated with the left side and larger numbers are associated with the right side in external space (Dehaene, Bossini, & Giraux, 1993).Therefore, subjects respond more quickly to larger numbers with their right hands and smaller numbers with their left hands.In addition, subjects respond more quickly to comparisons when numbers are farther apart from one another than when they are closer together (distance effect), indicating that people have stronger connections between numbers that are closer on a number line (Duncan & McFarland, 1980).Together, this research has provided evidence for internal magnitude representation as a "mental number line." Internal magnitude representation has been measured using a variety of techniques (Barth & Paladino, 2011;Bouwmeester & Verkoeijen, 2012;Cohen & Blanc-Goldhammer, 2011;Peeters, Degrande, Ebersbach, Verschaffel, & Luwel, 2016;Sasanguie & Reynvoet, 2013;Slusser, Santiago, & Barth, 2013).The most popular task used to measure how accurately subjects relate number to space concretely is the number line estimation (NLE) task (Siegler & Opfer, 2003).The NLE task requires subjects to either mark on a number line where a given number should be placed (number to position NLE task) or identify the number that corresponds to a spot already marked on a number line (position to number NLE task).The present study and review focus on tasks that use the number to position NLE task with symbolic endpoints (Arabic numbers), although there is also research that considers effects when the NLE task has non-symbolic endpoints such as dots.
Although variants of the procedures exist, the symbolic NLE task is arguably the most well-studied of the variants, and characteristics of the symbolic NLE task have been studied in relation to performance on other tasks as well.Over time, students become better on the NLE task, and individual differences in performance on the NLE task are associated with higher math achievement on curriculum based tests in elementary school (Booth & Siegler, 2006;Sasanguie et al., 2012;Sasanguie et al., 2013;Siegler & Booth, 2004).The relationship is also significant after controlling for IQ, indicating that performance on the NLE task is specifically predictive of math-Gross, Gross,Kim et al. 555 ematics skill acquisition beyond general intelligence (Geary, 2011).Performance on the NLE task may not predict all aspects of math achievement, however; a correlation between the average error on the NLE task and accuracy on a timed arithmetic test for first through third grade subjects was not significant, but the correlation between the average error on the NLE task and the general curriculum-based math achievement task was significant (Sasanguie et al., 2013).Therefore, there may be specific aspects of the NLE task that are related to performance on more general curriculum-based tests that are not found in timed arithmetic fact tests.
As with math achievement, there are several ways to evaluate performance on the NLE task, and the most common for the NLE task is average error.In addition to the average error, the pattern of response has also been identified as a relevant aspect of the task, with some children displaying a logarithmic response pattern and others displaying a linear response pattern (Booth & Siegler, 2006;Siegler & Booth, 2004;Siegler & Opfer, 2003;Siegler, Thompson, & Opfer, 2009).Logarithmic responding is characterized by a pattern of response in which the subject affords more space to smaller numbers on the left side of the number line and then condenses the spaces between larger numbers on the right side of the number line.Linear responding is a pattern of accurate response in which the child equally spaces numbers from left to right.The majority of kindergarteners respond logarithmically to a 0-100 number line, but the majority of second grade students respond linearly (Booth & Siegler, 2006;Siegler & Booth, 2004).Response patterns appear to undergo a logarithmic-to-linear shift as particular numbers become familiar, so children can simultaneously have a linear representation in one order of magnitude (e.g., 0-100 number line) and a logarithmic representation in another order of magnitude (e.g., 0-1000 number line); this is the case for the majority of second grade students, who are less familiar with larger magnitudes (Booth & Siegler, 2006;Siegler & Opfer, 2003).By sixth grade, students are more likely to respond in a linear fashion to both the 0-100 and 0-1000 number lines (Booth & Siegler, 2006;Siegler & Opfer, 2003).
With the ability to hold multiple representations at the same time, it appears that the logarithmic-to-linear shift in performance on the NLE task may be driven by a familiarity with numbers that arises from school instruction.
Further evidence of the importance of school instruction comes from the fact that adults from cultures without formal mathematical instruction map number to space logarithmically, whereas the majority of adults from Western cultures map number to space linearly (Dehaene, Izard, Spelke, & Pica, 2008).While the logarithmic-tolinear shift appears to be a product of formal education, it does not occur at the same time for all children.Children with dyscalculia, for example, tend to show more logarithmic response patterns than children with typical development from ages 8 to 10 ( Kucian et al., 2011).
Due to the developmental logarithmic-to-linear shift, linearity has been used as a performance measure on the NLE task.Mathematics achievement has been correlated with linearity based on individual fit statistics to linear functions using R 2 (Ashcraft & Moore, 2012;Booth & Siegler, 2006;Siegler & Booth, 2004).However, R 2 reflects the degree of fit to a linear function but does not account for the degree of linearity compared to logarithmicity.Therefore, a function may be low in R 2 due to deviation from the correct answer on a set of items, but the pattern of response may still more appropriately fit a linear function than a logarithmic function.A mixed loglinear model (MLLM) may be more appropriate for specifically identifying the degree of logarithmic responding (Anobile, Cicchini, & Burr, 2012;Cicchini, Anobile, & Burr, 2014).The degree of logarithmic responding in the MLLM fit for performance of kindergarteners (0-30), first graders (0-100), and second graders (0-1000) was found to be significantly correlated with accuracy on addition and subtraction problems (Kim & Opfer, 2017).

Number Line Accuracy Measures and Math Achievement 556
Although there is robust evidence to suggest that performance on the NLE task is better in older and more welleducated children, the connection between better performance on the NLE task and changes to internal magnitude representation is still under debate (Barth & Paladino, 2011;Bouwmeester & Verkoeijen, 2012;Cohen & Blanc-Goldhammer, 2011;Peeters et al., 2016;Sasanguie & Reynvoet, 2013;Slusser et al., 2013).The lower errors and increased linearity on the NLE task may be due to a change in the subjects' strategy use and familiarity with the subject matter rather than due to a change in the subject's internal magnitude representation.The influence of strategy in the NLE task is demonstrated through the pattern of lower errors around important markers, specifically the endpoints and midpoint (Bouwmeester & Verkoeijen, 2012).If a true change in internal magnitude representation was occurring, lower errors around midpoints and endpoints would not be predicted because internal magnitude representation is continuous, unbounded, and does not have an identifiable midpoint (Bouwmeester & Verkoeijen, 2012).When the midpoint is marked on the NLE task, accuracy is even higher at the midpoint (Peeters et al., 2016).In addition, subjects who talk more about using markers as guides for their responses perform better on the task and have higher mathematics achievement (Peeters et al., 2016).
Not only does attention to the midpoint and endpoints change performance, but also the range of numbers used for the number line changes the pattern of responding (Hurst, Leigh Monahan, Heller, & Cordes, 2014).
College-aged students who responded linearly on number lines with familiar endpoints responded logarithmically on number lines with non-familiar endpoints and also responded logarithmically to a number line with letter anchors (Hurst et al., 2014).Thus, the logarithmic-to-linear shift witnessed in early elementary school for the majority of students may be due to strategy use and familiarity with the stimuli rather than a change in internal magnitude representation.
If responses on the NLE task are driven by strategy use, specifically how subjects use the midpoint and endpoints, then performance on the task may be more accurately characterized by a function that predicts minimal errors around the markers in use, the cyclical power model (Barth & Paladino, 2011;Cohen & Blanc-Goldhammer, 2011;Slusser et al., 2013).The cyclical power model, explained in more detail below, has directly challenged the logarithmic-to-linear shift by viewing the change in performance on the NLE task as one that evolves continuously through the use of various midpoint strategies rather than categorically, from a logarithmic-to-linear response (Barth & Paladino, 2011).In addition, the use of the cyclical power model acknowledges that the NLE task is one that requires a proportional judgment (e.g., given the total length of the number line, how much space should I allot to this given number?).True internal magnitude representation is not bounded and thus not one of proportion judgments, so the use of cyclical power model is intended not to more closely resemble changes in internal magnitude representation but to instead capture strategy use by the participant.
The specific strategy use that is predicted determines the type of cyclical power model used, with a one-cycle cyclical power model appropriate for participants using the endpoints as markers, and the two-cycle cyclical power model appropriate for participants using the endpoints and midpoints as markers.
Due to these theoretical disagreements, comparisons between the cyclical power model and the logarithmic-tolinear shift have been conducted (mostly with group level data) with mixed results.In these studies, the fit of a mixed logarithmic-linear model (MLLM) representing the logarithmic-to-linear shift is favored in some (Dackermann, Huber, Bahnmueller, Nuerk, & Moeller, 2015;Kim & Opfer, 2017;Opfer, Thompson, & Kim, 2016), while the fit of the cyclical power model is favored in others (Barth & Paladino, 2011).Although group level data has been used to compare the logarithmic-to-linear shift and the cyclical power model, the relationship between individual differences in performance based on the cyclical power model and math achievement Gross,Gross,Kim et al. 557 has not been as well established.An individual differences analysis using the model fit statistic Akaike Information Criterion (AIC) to compare the preferred fit of various models on the NLE task did show that performance on the NLE task in higher grades tended to be more appropriately captured by the cyclical power model than a model capturing the logarithmic-to-linear shift, indicating that strategy may be fundamental to task performance in higher grades (Sasanguie, Verschaffel, Reynvoet, & Luwel, 2016).In a separate study, the cyclical power model was not associated with accuracy on addition nor subtraction problems in younger subjects, but the fit for the MLLM, representing the logarithmic-to-linear shift, did significantly predict performance (Kim & Opfer, 2017).
Given the theoretical disagreement in the literature, it is not clear how well each method of scoring can predict performance on math achievement tests.The purpose of this study is to examine how different ways of measuring accuracy on the number line task using methods from theoretically different origins predict math achievement.

Quantitative Genetics
Behavior genetics is a tool used to investigate the origins of individual differences in traits by calculating the proportion of the variation accounted for by genetics, shared environment, and nonshared environment or error in a trait.Behavior genetics has particular utility in the present study because a direct measure of the amount of variation accounted for by genetics and different aspects of the environment on the NLE task has not been conducted.In addition, examining the NLE data in a behavior genetics framework will allow us to validate the theoretical underpinnings of each approach.Theoretically, the shared environmental factor, which would account for shared experiences in school settings, would be significant for the logarithmic-to-linear shift because schooling is linked to more linear responding on the task.Training programs have also shown success in initiating the logarithmic-to-linear shift (Kucian et al., 2011;Opfer & Siegler, 2007;Ramani & Siegler, 2008;Siegler & Ramani, 2009;Thompson & Opfer, 2008, 2016).Therefore, there is evidence that the shared environment such as attending the same school may affect the shape of responding on the NLE task.Schooling may also affect the individual's application of strategy, with older children being more likely to display patterns of responding consistent with the cyclical power model (Sasanguie et al., 2016).Genetics are expected to drive variation for all measures of the NLE task due to the cognitive processes required by the task as extant behavioral genetic research provides ample evidence for the pervasive influence of genes on individual differences in general cognitive ability and in specific cognitive abilities (Plomin, DeFries, Knopik, & Neiderheiser, 2013).Behavioral genetic models have not previously been applied in the same sample of research participants on NLE measures reflecting both cyclical power model and logarithmic-to-linear shift approaches; therefore, we view these analyses as exploratory.Nevertheless, if the NLE measures representing the cyclical power model and the logarithmic-to-linear shift differentially predict math achievement, behavioral genetic analyses may provide important information on whether genetic, shared, and/or nonshared environmental influences mediate the differential prediction.

Present Study
This study is unique in using an adolescent sample to evaluate individual differences in responses on a NLE task, which not only will fill a grade and age hole in the literature but will also describe individual differences at an older age range in which linear responding may be assumed but not proven for all participants.Although the logarithmic-to-linear shift for numbers 0-1000 occurs in elementary school for most children, performance still Number Line Accuracy Measures and Math Achievement 558 varies in sixth grade.For example, at age 12, a logarithmic function was still the best fit function for 28% of participants (Siegler & Opfer, 2003).As further evidence that age is not a substitute for the developmental stage a child is in, some older children still display overestimation of smaller numbers while some younger children display linear responding (Bouwmeester & Verkoeijen, 2012).In fifth grade, modeling the use of the midpoint is still not universal, as only 58% of the subjects were best fit by that model (Rouder & Geary, 2014).Thus, although the transition to more sophisticated responding on this task is often studied in early to late elementary school, there is still variation in response patterns after elementary school.In addition, the present study benefits from a longitudinal design, in which stability of performance between time points can be assessed.Correlations between 5-year old performance on the NLE task at 10-week interval measurements ranged from .43 to .50 for endpoints 1-10 and .37 to .56 for endpoints 1-100 (Muldoon, Towse, Simms, Perra, & Menzies, 2013); however, a longitudinal assessment of the stability of number line estimation performance in adolescents has not been conducted.
Overall, each measurement style is attempting to measure different characteristics of performance on the NLE task.However, given that we are attempting to use the number line estimation task to understand individual differences and the predictive validity for achievement, a comparison of the accuracy measures, though they are theoretically different, is called for.The present study attempts to answer the following three questions.
First, do the measures closely resemble one another?It was hypothesized that, given the theoretical differences between the measures, these accuracy measures would not be highly correlated.Second, are the accuracy measures distinct in their predictive value of different types of math achievement?It was hypothesized that NLE task performance would be most highly predictive of math achievement measures that involve complex math reasoning involving proportion judgment.Third, are the accuracy measures distinct in their genetic and environmental origins?It was hypothesized that variation in all measures would be significantly predicted by a genetic component due to the cognitive nature of the task.In addition, it was hypothesized that the shared environment component would be relevant for all measures due to past studies that have demonstrated the influences of schooling and intervention.and 56% were female.DNA genotyping was used to determine zygosity, and in cases without genotyping consent, zygosity was established using a parental questionnaire (Goldsmith, 1991).

Measures Number Line Estimation Task
The NLE task was administered to participants at ages 12 and 15 via a pencil and paper format with 0 and 1000 displayed at opposite ends of the number line (Opfer & Siegler, 2007).The participant was instructed to identify the location of 500 on the number line before beginning the trials, and the administrator corrected mistakes made.During the trial phase, participants were presented with a new number line anchored at the ends with 0 and 1000 and a number at the top of the page and were asked to mark on the line the appropriate location of the number.In total, 22 trials were presented sequentially in the same ascending, non-randomized order for all participants (i.e., 2,5,18,34,56,78,100,122,147,150,163,179,246,366,486,606,722,725,738,754,818,938).Accuracy on the NLE task was characterized by four measures (explained in more detail be- Mixed Log-Linear Model -The mixed log-linear model characterizes the shape of response between a linear and logarithmic function (Anobile et al., 2012;Cicchini et al., 2014).In Equation 2, R is a vector of the student's responses, N is a vector of the numbers given, a is a scaling parameter, and λ is the degree of logarithmic trend in the response pattern.The accuracy value targeted here is λ, which ranges from 0 (perfectly linear responding) to 1 (perfectly logarithmic responding).

One-Cycle Cyclical Power Model -
The one-cycle cyclical power model is a function that predicts the use of 0 and 1000 as anchors that assist in the proportion judgment (Barth & Paladino, 2011;Hollands & Dyre, 2000; Number Line Accuracy Measures and Math Achievement 560 Spence, 1990).The model predicts that subjects using the endpoints as anchors will have a predictable pattern of response, overestimating on one side of the midpoint and underestimating on the other side of the midpoint.Note.N ≥ 500.

Math Achievement
The Woodcock Johnson III (WJ-III) was administered as a measure of math achievement and included mathematics subtests Math Fluency, Calculation, and Applied Problems at ages 12 and 15 (Woodcock, McGrew, & Mather, 2001).Math Fluency is a measure of how quickly participants can solve simple math problems; participants were given three minutes to solve 160 addition, subtraction, and multiplication problems using paper and pencil.Performance was based on the number of problems solved correctly in the limited time.The Calculation subtest measures the ability to solve math problems using paper and pencil, ranging in difficulty from single digit addition and subtraction to geometry, trigonometry, and calculus problems.There was no time constraint on the Calculation subtest.The Applied Problems subtest measures the participant's ability to solve story problems in a free response format without a time constraint.The story problems were presented visually and read aloud to the subjects.The subtest required participants to understand the story problem and apply the appropriate mathematical procedure without assistance.Some items included spurious information that was to be ignored for successful completion of the problem.Problem difficulty ranged from basic arithmetic to more advanced topics in probability, and several problems tested proportional reasoning.

General Cognitive Ability
A general cognitive ability summary measure was compiled from several measures, including the Boston Naming Test (Kaplan, Goodglass, & Weintraub, 1983), Clinical Evaluation of Language Fundamentals (CELF) Word Classes subtest (Semel, Wiig, & Second, 2003), Woodcock Reading Mastery Test -Revised (WRMT-R) Word Identification subtest (Woodcock, 1998), Wechsler Intelligence Scale for Children (WISC) Symbol Search subtest (Wechsler, 2004), and Comprehensive Test of Phonological Processing (CTOPP) Rapid Digit Naming and Rapid Letter Naming subtests (Wagner, Torgesen, & Rashotte, 1999) administered at age 12.The first, unrotated principal component was used as a summary measure for general cognitive ability, as was done in a previous publication (Lukowski et al., 2014).

Descriptive Statistics
Descriptive statistics for raw values are listed in variance, and the 15-year-old NLE measures shared 79% variance.The correlations between the time points were also significant for all of the measures (PAE: .37,λ: .44,β 1 : .45,β 2 : .28).Within time points, all measures of NLE predicted all three measures of math achievement except for β 2 and Fluency at 12 years old.In addition, the relationship between all 12-year-old NLE measures and 15-year-old math achievement measures were significant except for β 2 and Fluency.

Number Line Accuracy Measures and Math Achievement 562
Table 1 Descriptive Gross, Gross,Kim et al. 563 Linear Mixed-Effect Models In order to account for non-independence of observations (due to the fact that the participants were each part of a twin dyad), linear mixed-effect models with random intercept and Satterwhaite correction were used (Kenny, Kashy, Cook, & Simpson, 2006).The lme4 package was used to perform the analysis in R (Bates, Mächler, Bolker, & Walker, 2015).
Equation 5. Linear mixed-effect model predicting achievement from NLE task performance and g.
The NLE measure is used to predict math achievement while controlling for g in Equation 5.In the equation, i represents the individual, and j represents the grouping variable (dyad).Y is math achievement, a 0 is the grand mean, d i is the random intercept, b 1 is the main effect of the g factor, b 2 is the main effect of the NLE measure, and e ij is error.The main effects of the NLE measure for the models are listed in Table 3.The main effect of the g factor was significant for all comparisons.The main effect of the NLE task performance after controlling for g was significant for all NLE measures at 12 and 15 years old for Applied Problems but not for Fluency and Calculation.amount of variance accounted for by genetics, shared environment (factors that make siblings more similar to one another) and nonshared environment (factors that make siblings less similar to one another) and error.
OpenMx, a software package in R, was used to conduct the twin analyses in order to get estimates of heritability, shared environment, and nonshared environment/error for the accuracy values at ages 12 and 15 (Neale et al., 2016).The estimates are listed in  Note.Values in brackets represent 95% confidence intervals.

Discussion
The Gross, Gross,Kim et al. 565 Although all measures of the NLE task were significantly correlated both cross-sectionally and longitudinally with all three math achievement measures, the relationship between the NLE task and two measures of math achievement were no longer significant once general intelligence was included as a predictor.We have two possible explanations for why performance on the NLE task predicts performance on the Applied Problems subtest once g is controlled but does not predict performance on the Fluency or Calculation subtests.
First, it is possible that internal magnitude representation only assists in performance on tests that are developmentally challenging for the participant.For example, in children in kindergarten through second grade, NLE task performance was predictive of accuracy on simple addition and subtraction problems (Kim & Opfer, 2017).
However, in a separate study of slightly older children (first through third grade subjects), NLE task performance was not predictive of performance on a timed arithmetic fact test, but NLE task performance was predictive of general curriculum math achievement (Sasanguie et al., 2013).In the case of kindergarten children, completion of addition and subtraction problems would be complex given the age, but by adolescence, perhaps "complexity" refers to applying mathematical concepts to story problems.Magnitude representation may assist in the development of complex mathematical abilities at different ages, but the level of complexity is determined by developmental stage.
Alternatively, the significant prediction of performance on the NLE task for Applied Problems may be due to shared characteristics of the tasks such as proportional reasoning requirements and strategy use.The Applied Problems subtest requires the participants to perform proportional tasks such as identifying what a third of a quantity would be in a story problem.In addition, story problems in the Applied Problems subtest require the participants to choose relevant information before performing operations; this is a more complex task that requires some strategy for higher performance (Woodcock, McGrew, & Mather, 2001).
The phenotypic analyses also gave insight into the stability of the task from age 12 to age 15.In the 3-year interval, the correlation between task performance was relatively stable (PAE: .37,λ: .44,β 1 : .45,β 2 : .28).In a previous study of 5-year olds, the correlation between performance (measured by PAE) on a NLE task across 30 weeks of measurement was .41 for 1-100 endpoints, .46 for 1-10 endpoints, and nonsignificant for 1-20 endpoints (Muldoon et al., 2013).Although the present study captures individuals during a different developmental period, and the time interval is larger, the measurement stability is similar between our study and the 1-10 as well as the 1-100 NLE task performance in Muldoon et al. (2013).It is possible that the correlations would have been higher in the adolescents if the test-retest interval was shorter, thus indicating more stability in performance as children age into adolescents, but that was not possible to directly assess in the present study.
As in the phenotypic analyses, the behavior genetic analyses also did not show any differentiation between the measures despite theoretical differences.Genetics were hypothesized to be influential in all measures given the amount of variation that is typically predicted in cognitive variables (Polderman et al., 2015).At age 12, genetics explain a large portion of the variation in the NLE performance, but by age 15, genetics are no longer predictive.Instead at age 15, individual differences are predicted almost entirely by the nonshared environment and measurement error, which means that the individual's environment is driving the variation rather than other more predictable parts of the environment such as school.By age 15, most subjects were able to complete the task quite well, so any individual variation that is explained may be due to individual motivation.
We also hypothesized that a significant proportion of the variation in performance on the NLE task would be due to shared environment because of the environmental influences demonstrated by previous studies Number Line Accuracy Measures and Math Achievement 566 (Dehaene et al., 2008;Kucian et al., 2011;Opfer & Siegler, 2007;Ramani & Siegler, 2008;Siegler & Ramani, 2009;Thompson & Opfer, 2008, 2016).Contrary to our hypothesis, shared environment was not a significant factor.This indicates that environmental effects that make twins more similar to one another (e.g., shared curriculum) are not operating on individual differences performance on the NLE task at ages 12 and 15 despite evidence from other studies to suggest that the environment is an important component in performance.One explanation for the lack of shared environmental influences in the present sample is that shared environmental influences are reduced in cases where the environment is the same for participants (e.g.standardized curriculum) and thus cannot drive variation in the measure.In those situations, shared environmental factors such as schooling can affect the mean performance but may not lead to individual differences.Even so, behavior genetics results are in contrast to the hypothesized results for the logarithmic-to-linear shift, which would have predicted the shared environment to largely explain variation.
Overall, there do not appear to be fundamental differences between the accuracy measures on the NLE task in samples of 12 and 15 year olds.All measures are highly correlated and approximately equally predictive of math achievement.The appropriateness of the accuracy measure for a given study of adolescents thus can be determined based on pragmatic and theoretical underpinnings of the study.PAE is beneficial in that it is calculated without fitting data to a model, and thus even when subjects' responses are extreme, a result can still be obtained.However, if individual differences of progress towards linearity are being sought, then the mixed loglinear model seems to still be the most appropriate, although the fit may not be appropriate in the cases of very low performers.In addition, the similarity of the one-cycle cyclical power model with mixed log-linear model has also been established.The high correlation between these measures indicates that they are both measuring a similar pattern of responding, but the cyclical power model may lose individual differences for an even larger subset of the lowest performers due to model fit concerns.This study provided evidence for relevant individual differences in magnitude representation for adolescents, a group whose magnitude representation has not been largely studied.The similarity of the measures despite differences in theoretical underpinnings has also been shown using both regression and behavior genetic analyses.

Participants
Data were drawn from the Western Reserve Reading and Math Project (WRRMP), a 10-wave longitudinal twin study in which same-sex twin pairs were recruited from school nominations and birth records in kindergarten or first grade.Data for the present study were drawn from the 8 th and 9 th measurement occasions.These waves of measurement were approximately 3.0 years apart (on average), and the participants averaged 12.2 years (SD = 1.2) in the 8 th wave and 15.4 years (SD = 1.4) in the 9 th wave.Participants who completed the number line task at age 12 and again at age 15 and who had scores on all four accuracy measures at each time point were included for the present study (N = 300; MZ: n = 130; DZ: n = 170).Participants were mostly White (94%), low): percent absolute error, mixed log-linear model, one-cycle cyclical power model, and two-cycle cyclical power model.Differences in the administration procedure (emphasizing and correcting the half mark vs. not drawing attention to the half mark) and item distribution (oversampling the left side of the distribution vs. evenly sampling the distribution) has been shown to change which model (a mixed log-linear model or a mixed cyclical power model) has a better fit to the responses(Opfer et al., 2016).In the comparison of the conditions, 58.33% of the individuals in the condition consistent with the data collection procedure in the present study had response patterns that were better fit by the mixed log-linear model rather than the mixed cyclical power model.Given that the approach had almost equal best fit results for both models, the results of the present study should not be influenced heavily by the appropriateness of fit of one function over the other due to the administration procedures and sampling.Percent absolute error -Percent absolute error (PAE) is the sum of the absolute value of errors divided by the total length of the number line times the number of trials (Equation1).Equation 1. PAE.

N
in this equation was divided by the total length of the number line so that the fitted values represented the trend of a proportion judgment.The data were fit with one free parameter (β 1 ) to indicate the shape of the function, with a value of 1 indicating a perfectly linear response, and values deviating further from 1 indicating greater distance from the linear response.All values of β 1 , shown in Equation 3, were recalculated to the absolute value of the difference between their original value and 1 so that values of 0 would represent perfectly linear responding, and values further from 0 would represent less linear responding.Equation 3. β 1.Two-Cycle Cyclical Power Model -Maturation of numerical ability would theoretically draw attention not onlyto the endpoints of the number line but also to its midpoint; subjects begin to use the midpoint as an anchor in the task, noting that values greater than 500 should be placed to the right of the midpoint, and values less than 500 should be placed to the left of the midpoint.The use of a midpoint strategy would lead to responses consistent with a two-cycle cyclical power model(Hollands & Dyre, 2000).The use of the midpoint creates a pattern of response in which numbers are overestimated between 0 and 250, underestimated between 250 and 500, overestimated between 500 and 750, and underestimated between 750 and 1000 as the participant judges the distance from the anchor point of their choosing or shows the opposite pattern of underestimate-overestimateunderestimate-overestimate depending on which anchors they reference.The free parameter for this model, represented in Equation 4a and 4b, is β 2. Values at 1.0 represent linear responding, and values further away from 1.0 represent greater deviations from a linear response.All values of β 2 were recalculated to the absolute value of the difference between their original value and 1 so that values of 0 would represent perfectly linear responding, and values further from 0 would represent less linear responding.Equation 4a.β 2 .Note.N < 500.Equation 4b.β 2 .
Statistics of Performance on the NLE Task and Mathematical Achievement Measures for 12 and 15-Years-Old (log) is the log-transformed percent absolute error.λ (log) is the log-transformed logarithmicity from the mixed log-linear model.|β 1 -1| is the corrected free parameter of the one-cycle cyclical power model.|β 2 -1| is the corrected free parameter of the two-cycle cyclical power model.WJ Fluency is the Woodcock-Johnson III Fluency subtest.WJ Calculation is the Woodcock-Johnson III Calculation subtest.WJ AP is the Woodcock-Johnson III Applied Problems subtest.Table 2 Correlations Between NLE Task Accuracy Measures and Math Achievement Measures at Ages 12 and 15 Note.PAE (log) is the log-transformed percent absolute error.λ (log) is the log-transformed logarithmicity from the mixed log-linear model.|β 1 -1| is the corrected free parameter of the one-cycle cyclical power model.|β 2 -1| is the corrected free parameter of the two-cycle cyclical power model.WJ Fluency is the Woodcock-Johnson III Fluency subtest.WJ Calculation is the Woodcock-Johnson III Calculation subtest.WJ AP is the Woodcock-Johnson III Applied Problems subtest.*p < .05. **p < .01.
debate about how to appropriately characterize performance on the NLE task has left an open question about how the theoretical stances reflected in the measurements differentially translate to prediction of math achievement.Does a method that describes the average error such as PAE predict math achievement better than a method designed to capture the logarithmic-to-linear shift such as the mixed log-linear model or a method designed to capture strategy use such as the cyclical power model (one-cycle or two-cycle)?The results of the analyses of this study provide several conclusions: 1) PAE, mixed log-linear model, one-cycle cyclical power model, and two-cycle cyclical power model are highly correlated with one another 2) The accuracy measures for each provide more predictive value for the Applied Problems subtest than the other math achievement measures when g is included as a predictor 3) Differences in behavior genetic estimates are not noted among the accuracy measures.First, the high correlations among the accuracy measures, especially in the 12-year-old sample, are notable given theoretical differences between the measures.Although the one-cycle cyclical power model and two-cycle cyclical power model account for performance based on strategy use, and the PAE and logarithmic-to-linear shift do not, the measures were still highly correlated.The highest correlation between the mixed log-linear model and one-cycle cyclical power model may be due to a similarity in the predicted shape of responding, in which both models capture overestimation of spaces on the lowest end of the number line.Such high correlations, especially between the mixed log-linear model and one-cycle cyclical power model, indicate that the different measures are mostly capturing the same variation.

Table 1 .
Participants with values on any one NLE accuracy measure greater than 3 standard deviations from the sample mean (all on the low end of performance) were

Table 3
Coefficients of the Linear Mixed-Effect Models Predicting Achievement From NLE Task Accuracy Measures

Table 4 .
The accuracy measures at age 12 were accounted for by nonshared environment/error (.51-.60) and genetics (.40-.49), and shared environment was not a contributing factor for any of the measures.In contrast, by age 15, performance on the number line was accounted for almost entirely by nonshared environment/error (.72-.96).

Table 4
Correlations Between MZ Pairs and DZ Pairs and Estimates of Genetic, Shared Environment, and Non-Shared Environment and Error for NLE Accuracy Measures for Ages 12 and 15