Math Underachievement – Its Causes and Consequences
Poor numeracy may even be more harmful for an individual than poor literacy (Butterworth, Varma, & Laurillard, 2011). Considerable largescale economic consequences of math underachievement have also been reported (OECD, 2010). Therefore, the cognitive and emotional causes of impaired numeracy are of major interest both theoretically and practically. Regarding emotional aspects, math anxiety (MA) is probably the most prominent and important variable that negatively correlates with math achievement. It has been extensively investigated since the 1960’s (see Dowker, Sarkar, & Looi, 2016 for review). A sizeable negative relationship between MA and math performance is evident in previous PISA (Programme for International Student Assessment) studies across virtually all countries involved in the program (see Lee, 2009 for PISA 2003). PISA 2012 (OECD, 2013) showed that 14% of variance in math performance was accounted for by students’ MA levels. In Poland, which is of particular importance for this study, it was 22% and this score was one of the highest among countries involved in the PISA programme. Nevertheless, MA levels seem to differ considerably between countries. Importantly, the same study revealed an increase in average MA levels in several countries compared to PISA 2003, although its relationship with math achievement did not change considerably. Thus, the problems for societies associated with MA seem to be growing rather than diminishing on a global level. Therefore, current research should address the assessment of MA in order to measure it online and on paperandpencil versions as reliably and validly as possible. The goal of this study is to evaluate the validity and reliability of an online version of the Abbreviated Math Anxiety Scale (AMAS; Hopko, Mahadevan, Bare, & Hunt, 2003) outside the EnglishAmerican Culture (in a Slavic culture) and to provide norms and critical differences for practical diagnostic assessment.
Math Anxiety – A Specific Instance of Anxiety
Math anxiety refers to a wide range of negative emotional responses related to math and situations involving math (Ashcraft & Ridley, 2005). Negative emotional responses vary from apprehension to dread, and may appear in both academic situations involving math as well as in everyday life situations that require using math (e.g. financial transactions).
Importantly, MA cannot be subsumed under general anxiety or test anxiety. The evidence for its specificity comes from psychometric data both in adults (e.g., Ashcraft & Ridley, 2005), and in children (Carey, Hill, Devine, & Szűcs, 2017). Converging evidence comes also from physiological (e.g. Ashcraft, 2002) and neuroscientific studies (Artemenko, Daroczy, & Nuerk, 2015 for discussion and some controversies), which can typically dissociate math and general anxiety.
However, MA cannot be subsumed under poor math skills either. It negatively correlates with math achievement, although only moderately (averaged correlation  .27; Ma, 1999 for metaanalysis). In line with this moderate relationship between MA and math ability, neurocognitive activation differences between high and low MA participants can be found even when math ability is matched (Pletzer, Kronbichler, Nuerk, & Kerschbaum, 2015). This confirms that MA and math skills need to be considered as independent concepts and not just as cognitive and affective sides of the same coin. The relationship between MA and math achievement seems to be reciprocal; poor math skills increase MA, and MA mitigates math performance (Carey, Hill, Devine, & Szűcs, 2016). Therefore, it is not surprising that interventions aimed at reducing MA lead to improvement in math achievement (Hembree, 1990 for metaanalysis). Importantly, the relation between MA and math skills is not unitary, but may be additionally qualified by at least three factors.

Very recently, Carey, Devine, Hill, and Szűcs (2017b) proposed that this relationship may be influenced by the individual’s anxiety profile (i.e., the relationship between math anxiety and other types of anxiety) rather than MA levels only.

Observed performance differences between high and lowanxiety individuals largely depend on testing conditions; the relationship is stronger in formal testing, especially when individuals work under time pressure (Ashcraft & Faust, 1994).

There is evidence for moderate genetic and nonshared environmental influences on MA (Wang et al., 2014), which might qualify the relation to math skills.
In sum, MA is a valid construct and cannot be reduced to anxiety in general or solely attributed to poor math skills. Studying this phenomenon is instructive for educational psychology and may contribute to our understanding of how anxiety affects (numerical) cognition in general (see e.g., SuárezPellicioni, NúñezPeña, & Colomé, 2016 for a review).
Gender Differences in Math Anxiety
Data on gender differences in MA are inconsistent across cultures, age groups and studies. The majority of studies suggests that females (both adults and children) tend to have higher MA than males. Such differences (i.e., women revealing higher levels than men) have been reported in the U.S. (see e.g., Hembree, 1990), the U.K. (Hunt, ClarkCarter, & Sheffield, 2011), Australia (Gyuris & Everingham, 2011), and outside AngloSaxon countries in Italy (Primi, Busdraghi, Tomasetto, Morsanyi, & Chiesi, 2014), Spain (NúñezPeña, SuárezPellicioni, Guilera, & MercadéCarranza, 2013), and Poland (Cipora, Szczygieł, Willmes, & Nuerk, 2015a), for instance. In line with these findings, females sometimes performed worse on math tests than cognitive capacitymatched males, and this poorer performance was attributed to lower MA (Devine, Fawcett, Szűcs, & Dowker, 2012). However, some studies report no gender differences (see e.g., Baloğlu, 2003; see also Gyuris & Everingham, 2011) or even higher MA in males (Devine et al., 2012 for a review). For instance, males scored higher in math anxiety in an Iranian sample (Vahedi & Farrokhi, 2011). These data seem to suggest that gender differences observed in Western cultures may not be universal.
Similarly, crosscultural variability in gender differences in MA was present in PISA 2012 (OECD, 2013). The effect size of this difference varied from d = 0.51 in favor of boys in Denmark and Switzerland (i.e., boys being characterized by lower MA than girls) to d = 0.19 in favor of girls in Qatar and Jordan (i.e., girls being characterized by lower MA). In Poland, the significant gender gap was small in size (d = 0.11 in favor of boys). Note, however, that PISA studies adolescents (15 year olds), thus the results may not be representative for adults. Indeed, a study on Polish adults (Cipora et al., 2015a) showed a much larger gender gap in MA (d = 0.61 in favor of males) than the PISA data of the same country (for possible reasons, see below).
In sum, because of the inconsistent findings across cultures, age groups and studies, we need reliable data from different cultures to better characterize cultural variation of MA and its gender differences. These data should come from cultures other than large Western cultures, in which the vast majority of studies are done, because of the strong crosscultural differences reviewed above.
Structure of Math Anxiety
The structure of MA is a matter of scientific debate: 2factor, 3factor, and 6factor models have been proposed and will be shortly outlined here:
2factor models: Usually MA is believed to comprise two factors: anxiety related to using math in everyday situations, and anxiety related to being evaluated in math (Suinn & Edwards, 1982). Sometimes, the first factor is related to anxiety in the context of learning math while the other is more related to test anxiety. The two MA factors are highly correlated (Hopko, 2003). The twofactor model is the most popular and common model of MA.
3factor models: A threecomponent model was proposed by Alexander and Martray (1989) with the factors (1) Math test anxiety, (2) Numerical Task Anxiety, and (3) Math Course Anxiety. Three factors also account best for the pattern of scores in the Mathematics Anxiety Scale – UK (MASUK) that was developed for the British population (Hunt et al., 2011). These factors were termed (1) Maths Evaluation Anxiety, (2) Everyday / Social Maths Anxiety, (3) Maths Observation Anxiety. Sometimes abstraction anxiety (anxiety related to abstract mathematical content) is also considered to be another component of MA (Ma & Xu, 2004).
6factor models: Analyzing several tools measuring MA, Kazelskis (1998) described 6 oblique factors of math anxiety involving: (1) Mathematics Test Anxiety, (2) Numerical Anxiety, (3) Negative Affect Toward Mathematics, (4) Worry, (5) Positive Affect Toward Mathematics, and (6) Mathematics Course Anxiety. This factor structure did not hold in the light of subsequent research. However, positive affect towards mathematics is sometimes considered as a component of math anxiety (Bai, 2011). The questionnaire MASR developed by this latter author was shown to reliably measure two independent aspects of math anxiety – positive and negative. Based on a confirmatory factor analysis of the MARS30Brief scale, Pletzer et al. (2016) conclude that a simple distinction between numerical anxiety and testing anxiety does not satisfactorily fit the data. Best fit was obtained by a model consisting of six factors (1) Evaluation Anxiety 1 – proper  Taking Math Exam, (2) Evaluation Anxiety 2 – Thinking of Upcoming Exam, (3) Learning Math Anxiety, (4) Everyday Numerical Anxiety, (5) Performance Anxiety, (6) Social Responsibility Anxiety.
In sum, the factorial structure of math anxiety is relatively variable and different structural proposals currently exist in the literature, however, most researchers agree that it is not a unidimensional construct.
AMAS – Abbreviated Math Anxiety Scale
One of the more recent instruments for measuring MA is the Abbreviated Math Anxiety Scale (AMAS) developed by Hopko et al. (2003). Since we are using the AMAS in this study, we outline its diagnostic properties in more detail.
Overall Characteristics
AMAS is a 9item questionnaire, characterized by very good psychometric properties. It takes less than 5 minutes to administer, making it very convenient as a screening tool and for research purposes. Apart from the total score, AMAS comprises two scales: (1) anxiety related to learning math (Learning), and (2) anxiety related to being tested in math (Testing). This structure was confirmed in several samples using both exploratory and confirmatory factor analyses (e.g., Hopko et al., 2003; Cipora et al., 2015a, 2015b).
Reliability
AMAS reliability is mostly around or above .80 and was measured by means of internal consistency (Cronbach’s alpha and ordinal alpha) and testretest reliabilities. In the first study, Hopko et al. (2003) reported its good internal consistencies (≥ .85) and testretest reliability (≥ .78). These estimates were similar in Iranian (Cronbach’s alpha ≥ .75; Vahedi & Farrokhi, 2011), Italian (Cronbach’s alpha ≥ .80; Primi et al., 2014), Polish (Cronbach’s alpha ≥ .78; testretest reliability ≥ .59; ordinal alpha ≥ .84; Cipora et al., 2015a, 2015b), German (Cronbach’s alpha for total score = .92; Dietrich, Huber, Moeller, & Klein, 2015), and Spanish (Cronbach’s alpha ≥ .87; Brown & Sifuentes, 2016) language versions. Reliability holds both for adults, highschoolers (Cronbach’s alpha for Italian highschoolers ≥ .81; Primi et al., 2014 and ≥ .76 for Polish secondary and high schoolers; Cipora et al., 2015b) and primary schoolers 811yearolds (Cronbach’s alpha in Italian sample ≥ .64; Caviola, Primi, Chiesi, & Mammarella, 2017). In a modified form AMAS was shown to be reliable for measuring math anxiety in British 813yearolds (Cronbach’s alpha > .74; ordinal alpha > .83; Carey, Hill, et al., 2017). It was also used in studies on English speaking younger adolescents (11 years and older, however reliability estimates for this age group were not provided; Devine et al., 2012). Adequate reliability estimates were also reported for the modified AMAS administered to Australian students (Cronbach’s alpha ≥ .83; Gyuris & Everingham, 2011; Gyuris, Everingham, & Sexton, 2012).
In sum, while AMAS reliability can be considered satisfactorily for a personality questionnaire in general, there are some cultural variations and especially some lower reliabilities for younger children.
Validity
In general, AMAS has been shown to possess convergent and discriminant validity: It correlates positively with other MA measurement scales (MARSR; .85), but only moderately with computer anxiety (.32), test anxiety (.52) and trait anxiety (.28) (Hopko et al., 2003). Factor analyses indicate sufficient construct validity of the twofactor structure both in adults (e.g., Hopko et al., 2003; Cipora et al., 2015a) and adolescents (Carey, Hill, et al., 2017; Caviola et al., 2017; Cipora et al., 2015b). Similar indices of construct, convergent and discriminant validities were reported in other language versions, including Farsi (e.g., .61 correlation with statistics anxiety measure; Vahedi & Farrokhi, 2011), Italian (e.g., .57 correlation with test anxiety; Primi et al., 2014; .32 correlation with the physiological anxiety subscale; Caviola et al., 2017) and Polish (e.g., .33 correlation with trait anxiety; Cipora et al., 2015a, 2015b). Carey, Hill, et al. (2017) in a large scale study by means of itemlevel factor analyses showed that (slightly modified) AMAS items loaded on a separate factor among other factors representing other types of anxiety (varied aspects of general anxiety and test anxiety).
AMAS was also shown to have expected patterns of correlations with more general differential psychology constructs, namely temperamental traits (e.g., .48 correlation with Emotional Reactivity; no correlation with Sensory Sensitivity nor Activity), as well as attitudes towards math and other school subjects (e.g., .50 correlation with liking math and .12 correlation with liking humanities; Cipora et al., 2015a, 2015b).
Importantly, the AMAS structure was shown to be genderinvariant, which was found both for the original English version (Hopko et al., 2003) as well as for other language versions including Farsi (Vahedi & Farrokhi, 2011), Italian (Caviola et al., 2017; Primi et al., 2014) and Polish (Cipora et al., 2015a).
The above results suggest that the AMAS can validly measure MA in various linguistic and cultural contexts. Therefore, the AMAS is also potentially suitable for (crosscultural) online studies. However, with the rise of online assessment, validity has become a crucial question. In the next section, we will discuss commonalities and specificities of online questionnaire measurement employed in this study compared to conventional assessment.
Online Questionnaire Measurement – Commonalities and Specificities Compared to PaperandPencil Administration
Online psychological measurement is becoming more and more popular. Advantages and disadvantages of online studies are extensively discussed by Reips (2002; see in particular their Table 1, p. 245; see also Reips, Buchanan, Krantz, & McGraw, 2015 for guidelines on online data interpretation). Online questionnaires are especially valuable for short surveys that take no more than 1015 minutes, so that it is easy to recruit participants and the dropout rate is low (Reips, 2002).
It seems that the mode of administration does not necessarily influence psychometric properties of measurement instruments. It also seems that differences between administration forms, observed when computerized measurement was first introduced, disappear over time and with familiarity with computers and the Internet. Close correspondence was demonstrated between paperandpencil versus computerized tests, but both tests were administered in the lab (see: Gwaltney, Shields, & Shiffman, 2008, for a metaanalysis of therapy outcome measurement; BoothKewley, Edwards, & Rosenfeld, 1992 for computer anxiety; King & Miles, 1995 for desirable social responding, sensitivity of equity and selfesteem). Similarly, strong correspondence was observed between results of online surveys and questionnaires distributed via traditional mail, e.g., for personality measurement outcomes (Pettit, 2002) or for several aspects of physical health and wellbeing (Ritter, Lorig, Laurent, & Matthews, 2004). Nevertheless, these authors explicitly recommend collecting normative data for each administration form.
It seems that under some provisions like careful control for inclusion criteria, large samples and posthoc data filtering to exclude randomly responding participants, online testing can be used in psychometric studies. It also seems that in many, but possibly not all instances, results from online and paperandpencil administration can provide equivalent results.
Applications of Computerized / Online Surveys for Math Anxiety Measurement
Ashcraft and Faust (1994) administered a paperandpencil version of MARS to half of their participants, and the other half received a computerized form of the same test, with the only difference between the two versions being that in the computerized form, items were presented one by one. Unexpectedly, participants who received the computerized form scored significantly higher (d = 0.62) than those who received the paperandpencil form. The authors ascribe this difference to the fact that using a computer evoked even more anxiety in individuals who normally are math anxious. Nevertheless, despite differences in means, observed effects of gender differences and correlation with math problem solving efficiency were similar for both administration modes. However, this study was conducted in the early 1990’s, when using computers was much less habitual than it is today – participants may have been more excited to be assessed via computer questionnaires than today. Therefore, we believe that it is important to reassess the limitations of online assessment validity, i.e. claim that MA scores obtained in computerized procedures will differ from paperandpencil administration.
It is important to note that some studies have administered the AMAS in an online version. Jones, Childers, and Jiang (2012) administered the AMAS twice; the first time as a paperandpencil and the second time as an online survey. The reported testretest reliability was very high (r = .91), providing initial evidence both for testretest reliability and AMAS suitability for online measurement. However, validation of online measurement was not the objective of that study, and no other information was provided – thus, we do not know, whether the higher values of Ashcraft do still hold, because high correlations are also possible, when all participants score a little bit higher in the online version. Ferguson, Maloney, Fugelsang, and Risko (2015) also administered the AMAS in an online format as part of a larger online battery aimed at investigating relationships between MA and spatial abilities. The study also revealed an expected pattern of correlations, however the validity of the AMAS online as compared to its paperandpencil version was not evaluated.
Objectives of the Study
The AMAS has been previously used in online administration mostly in Englishspeaking samples. Here, we evaluated reliability and validity of the AMAS online version in a different linguistic and cultural background, because of crosscultural differences reported for offline versions of math anxiety assessment. However, equivalence with paperandpencil administration cannot be taken for granted as the above review showed – so the construct validity and the norms of the online version will be compared to a paperandpencil version. Moreover, the comparison of online assessment with paperandpencil assessment allows us to test whether early claims of differences between MA scores between computer and paperandpencil administration still hold, when online assessment and computer use has become much more common (Ashcraft & Faust, 1994).
Finally, for investigating any individual characteristics, both researchers and practitioners should be equipped with adequate normative data. Thus, following up on the initiative by Caviola et al. (2017), who published AMAS percentile norms for Italian primary school children, we prepared norms for the Polish AMAS based on data we collected up to date (n = 2057) and we hope this initiative will be taken up by other researchers.
Method
Participants
A total of 615 participants (418 female and 197 male), mean age 22.0 years (SD = 3.9; range 1750) participated in the study.
Participants were recruited via email circulated through university mail at several universities in Poland in various cities (Kraków, Lublin, Warszawa, Toruń, Wrocław) as well as by sharing a link via social media. In the announcement, we invited individuals to participate in a study, and if possible, to share the invitation with other potentially interested individuals. In the invitation it was stated that the whole procedure will take about 3 minutes and that the scale comprises only nine items. It also stated that the aim of the study was to prepare the validation of a questionnaire aimed at measuring a specific kind of anxiety in a Polish sample. All receivers were free to ignore the email and did not receive any compensation for their participation. All participants received the same link, therefore ensuring data anonymization.
Most participants were university students; however, some participants recruited via social media were also employees. They represented a varied range of fields of studies / occupations: Psychology (n = 380), STEM (science, technology, engineering, and mathematics; n = 141; comprising: engineering  varied specializations, computer science, architecture etc.), neurobiology / neuropsychology / cognitive science (n = 54), humanities and pedagogy (n = 27; comprising: pedagogy, philosophy, literature etc.), other (n = 11) and two individuals did not answer this question. (In the shared dataset one can access the study sample composition by gender. This can be done by running the specified parts of the shared R code, see Supplementary Material.)
Materials and Procedure
The online version of the AMAS questionnaire was identical to the paperandpencil version in terms of content and scale formation (see Cipora et al., 2015a). The only difference was that a nickname, age, gender, and field of study sections were presented at the bottom of the website.
The form started with an instruction stating that the participant will see some statements related to learning math. She or he was asked to mark the level of anxiety it evokes/would evoke in her/him. Responses were given on a 5point Likert scale. Scale extremes were labelled with mild anxiety and strong anxiety.
The Google Documents form was used in the default graphic. All items were marked as obligatory. The entire questionnaire fit on one page. After clicking the link, participants were redirected to the Google Documents form. The whole procedure lasted approximately 3 minutes.
The theoretical range of sum scores across all nine items (AMAS total score) was 945, whereas for the Learning and Testing scales it was 525 and 420 respectively.
Data Analysis
In the present work, we examined the reliability and construct validity of the AMAS questionnaire. We compared the AMAS results obtained from this online study with the previous findings for paperandpencil administration to evaluate commonalities and differences between online and paperandpencil administration.
To estimate AMAS reliability, we measured Cronbach’s alpha, which is one of the most popular reliability measures. Nevertheless, feasibility of this reliability measure has been strongly criticized recently. Critics point out that it is computed using Pearson correlation coefficients among items, which may be biased when the raw data is not continuous and the number of different itemscale values is small, which is the case for Likerttype responses. Therefore, Cronbach’s alpha may underestimate reliability, and this problem can be even more severe for scales comprising only a small number of items like the AMAS (Yang & Green, 2011). Another source of problems may originate from nonnormal distributions of both true scores and error scores (Sheng & Sheng, 2012). Ordinal alpha, computed using the polychoric correlation coefficient, can be an alternative that addresses these issues. Polychoric correlations take into account the fact that the observed data (i.e., Likerttype data) are discrete manifestations of a continuous latent construct (Zumbo, Gadermann, & Zeisser, 2007).
We investigated the AMAS factor structure by means of confirmatory factor analysis. Namely, we examined whether the factor structure of AMAS observed in the previous study can be replicated in an independent dataset collected with a different administration form. Additionally, we conducted an exploratory factor analysis (Appendix A) with oblique rotation, because the theoretical expectation is that different aspects of MA are inherently correlated.
Lastly, we compared the AMAS results obtained via online administration with the data obtained in a previous study (Cipora et al., 2015a) based on the paperandpencil setup of the AMAS. In the case of reliability estimates and correlations, we also calculated 95% confidence intervals to compare reliabilities obtained via online and paperandpencil administrations.
Analyses were conducted in R (R Core Team, 2017) and SPSS 24. The confirmatory factor analysis was conducted in AMOS 24 software (Arbuckle, 2016).
Data Availability
Raw data along with R script and SPSS syntax we used for analyses can be found in the Supplementary Material.
Results
Descriptive Statistics
The average AMAS total score was 20.96 (SD = 6.22), slightly below the theoretic scale midpoint of 27. The distribution deviated significantly from normality as assessed with the ShapiroWilk test, W(615) = 0.98, p < .001. Skewness was 0.43 (SE = 0.10) and kurtosis was 0.24 (SE = 0.20). Skewness divided by its standard error falls outside the ±2 range, so the distribution may be considered to be skewed. However, with respect to kurtosis, there was no significant departure from normality. The average score for the Learning scale was 7.59 (SD = 2.80). It was close to the theoretical scale minimum of 5. A subgroup of 171 participants (27.8% of the sample) scored the scale minimum, 331 participants (53.8%) scored 7 or less. The distribution significantly deviated from normality, W(615) = 0.86, p < .001. Skewness was 1.23 (SE = 0.10) and kurtosis was 1.63 (SE = 0.20). Both skewness and kurtosis divided by their respective standard error fell outside the ±2 range, indicating strong deviation from normality. In case of the Testing scale the average score was 12.97 (SD = 3.94), above the theoretical scale midpoint of 12. The distribution also deviated significantly from normality, W(615) = 0.97, p < .001. Skewness was 0.17 (SE = 0.10) and kurtosis was 0.85 (SE = 0.20). Skewness divided by its SE falls within ±2 range, so there is no major deviation from normality for this parameter, although this is not the case for the kurtosis estimate. Distributions of results for the scale total as well as for both scales are depicted in Figure 1.
Figure 1
Descriptive statistics for all items individually are also reported in Table 1. By conducting reliability analysis with supplementary R code one can access a table containing proportion of each response alternatives to each item. By using another part of the code one may generate a plot presenting frequencies of given response alternatives for each item. The Pearson correlation between both scales was .55. Correlation between the Learning scale and the total score was .85, and the correlation between the Testing scale and the total score was .91.
For the total score as well as for both subscales there were no differences in variance between genders (Levene’s tests all ps > .139). Female participants scored higher than males. In females, the average total score was 21.62 (SD = 6.33) and for males it was 19.54 (SD = 5.73), t(613) = 3.91, p < .001; d = 0.34. Noteworthy, when the analysis was conducted separately within each field of study, the differences were present in the STEM (i.e., science, technology, engineering, and mathematics) group only (d = 0.53). By running the supplementary R code one may access more detailed statistics.
In the Learning scale, the mean score for females was 8.17 (SD = 3.22), and for males it was 7.59 (SD = 2.80), t(613) = 2.16, p = .031, d = 0.19. In the Testing scale, females scored 13.45 (SD = 3.95) whereas males scored 11.95 (SD = 3.73), t(613) = 4.48, p < .001, d = 0.39.
Reliability
Cronbach’s Alpha
Cronbach’s alpha coefficient reliability estimates were .84, .72 and .85 for the AMAS total, Learning, and Testing scales, respectively. Coefficients did not increase after any item was excluded. Average interitem correlation was .37, .34, and .59 for the total score, Learning, and Testing scales, respectively. Item characteristics are summarized in Table 1.
Table 1
Item  Item content  Scale  M  SD  Corrected^{a} item total correlation

CFA, squared multiple correlation



total  Learning  Testing  Learning  Testing  
1  Using tables  Learning  1.51  0.88  .38  .36  .32  .19   
2  Test one day before  Testing  3.03  1.23  .71  .51  .75    .75 
3  Watching teacher’s work  Learning  1.67  0.96  .57  .57  .45  .52   
4  Math exam  Testing  3.64  1.15  .61  .39  .70    .61 
5  Homework  Testing  2.65  1.18  .64  .52  .59    .53^{b} 
6  Attending lecture  Learning  1.64  0.92  .48  .53  .36  .50   
7  Other student explaining Math  Learning  1.60  0.88  .43  .49  .32  .39   
8  Pop quiz  Testing  3.65  1.17  .65  .42  .73    .68 
9  New chapter  Learning  1.57  0.86  .52  .46  .45  .44   
^{a}In the case of correlation with the total score and the scale a given item is assigned to, corrected itemtotal correlations are provided (i.e., correlations with the scale score excluding the given item); in case of correlations with the other scale, bivariate correlations between the item and the scale score are presented. ^{b}Allocated to scale with higher loading.
Ordinal Alpha
In order to estimate ordinal alpha, we used the method proposed by Gadermann, Guhn, and Zumbo (2012). Ordinal alpha was .88, .80, and .88 for the total scores, Learning, and Testing scales, respectively. Reliabilities did not increase after any item was excluded.
Construct Validity: AMAS Factor Structure – Confirmatory Factor Analysis
In the second step of analysis, we performed a confirmatory factor analysis on our data. We examined the fit of the model that attained the best fit for our paper–andpencil data (see Cipora et al., 2015a, 2015b). The model reflects the structure of the AMAS postulated by Hopko et al. (2003), with the only exception being item no. 5 (Homework), which was modeled to contribute to both latent variables. Nevertheless, the model comprising single loading for this item also reached acceptable fit to the data.
The multivariate normality assumption was violated (multivariate kurtosis = 26.73; critical ratio = 23.55). Therefore, similar to the paperandpencil data, we used an asymptotically distributionfree (ADF) method. This method is suited for modeling Likerttype data and may be considered equivalent to estimation based on polychoric correlations. For the same reason, CMIN/DF measures of model fit are not reported, since they are sensitive to violations of the normality assumption (Bedyńska & Książek, 2012). The model, together with standardized path coefficients, is presented in Figure 2. All parameters differed significantly from 0 which means that all factor loadings were significant. Noteworthy, the same model reached acceptable fit for both female and male participants groups separately.
Figure 2
Apart from Item 1 (Using tables) and Item 5 (Homework), all path coefficients were at an acceptable level of > .60. Squared multiple correlations for each item are presented in Table 1. Apart from Item 1, all factor loadings are close to or above .40. The fit of the model was good (RMSEA = .079; 90% confidence intervals .065.093; AGFI = .895).
In the next step, we investigated invariance of the factor structure across genders. As AMOS 24 does not allow multigroup analysis when the ADF method is used, we conducted another analysis using the maximum likelihood method. We estimated two nested models, unconstrained and the model with the same measurement weights between genders. Formal comparison of χ^{2} values revealed that the unconstrained model fit the data better than the constrained one (p = .001). On the other hand, using χ^{2} statistic to evaluate overall model fit is problematic (i.e., in case of large sample sizes the χ^{2} statistic almost always rejects a model; Jöreskog & Sörbom, 1993). Using this method for model comparison raises the same problems (SchermellehEngel, Moosbrugger, & Müller, 2003). Overall fit of both the constrained and unconstrained model estimated by means of RMSEA was .056 for the unconstrained and .57 for the constrained model, and confidence intervals were .046  .067 and .047  .067 respectively.
Format Commonalities and Differences: Online Versus PaperandPencil Administration
In order to further evaluate the online version of the AMAS questionnaire, we compared the results obtained for the AMAS online with a previous Polish AMAS study, where data had been collected by paperandpencil administration (Cipora et al., 2015a). The comparison is presented in Table 2.
Table 2
Measures  Online study  Paperandpencil  Comments  

M (SD)  Total  20.96 (6.22)  21.93 (6.63)  t(1470) = 2.85, p = .005, d = 0.15; small effect size 
Learning  7.99 (3.10)  8.32 (3.67)  t(1470) = 1.90, p = .057, d = 0.10; small effect size  
Testing  12.97 (3.94)  13.61 (4.01)  t(1470) = 3.02, p = .003, d = 0.16; small effect size  
Gender differences in total score  significant d = 0.34  significant d = 0.61  Higher scores in females, larger gender difference in paperandpencil setting  
Correlation between scales  .55 (.49.60)  .49 (.44.54)  Confidence intervals largely overlap, minimally higher scores in case of online administration  
Correlation Learning  total  .85 (.83.87)  .85 (.83.87)  
Correlation Testing  total  .91 (.90.92)  .88 (.86.89)  
Reliability – alpha (95% c.i.)  Total  .84 (.82.86)  .85 (.83.86)  Confidence intervals largely overlap. lower reliability for Learning scale in case of online administration 
Learning  .72 (.68.75)  .78 (.76.80)  
Testing  .85 (.83.87)  .84 (.82.86)  
Reliability – ordinal alpha  Total  .88  .88  Virtually identical estimates 
Learning  .80  .84  
Testing  .88  .87  
Factor structure (exploratory FA, oblique rotation)  58.3% variance explained by twofactor solution  61.6% variance explained by twofactor solution  Factor structure reflects original scales  
Confirmatory FA  RMSEA = .079; AGFI = .895  RMSEA = .075; AGFI = .905  The same structural model reaches acceptable fit with observed data 
Mean scores differed between administration forms. However, effect sizes related to these differences can be considered as small. In the online study, we observed smaller gender differences than in the paperandpencil study.
Correlations between scales and between scales and total scale were very similar and their confidence intervals largely overlap. The same is true for all reliability estimates we used.
Norms
Having administered AMAS in three separate studies (this contribution; Cipora et al., 2015a, 2015b; this study was not published, however, its results were presented as a conference poster which is available at http://doi.org/10.17605/OSF.IO/QB768), we had data from 2057 individuals of varied age and education. Thus, it was possible to establish AMAS norms for different age groups and administration methods. The normative data was prepared for (1) secondary schoolers (grades 79); (2) high schoolers (grades 1012); (3) adults tested with paperandpencil questionnaires, and (4) adults tested online. For each group norms were prepared for female and male participants separately. All tables are presented in Appendix B.
Percentile Norms
Percentile norms indicate the percentage of participants who score below and up to a given score (i.e., percentage of participants scoring below + 50% of the participants obtaining a given raw score, according to the proposal by Crawford, Garthwaite, and Slick (2009). To make our norms comparable to those reported by Caviola et al. (2017), in the first step we prepared percentile norms (Table B1).
Standard Norms
Additionally, we prepared standard cnorms (M = 5; SD = 2), since a standard norm is required to be able to compare the performance of an individual participant with inferential statistical procedures (cf. Willmes, 2010). The probability transform was used to approximate normal distributions for originally skewed raw data and smoothing of the raw score distribution was used to mimic a continuous variable. In order to obtain zquantile scores for each raw score, half of the frequency of that score is subtracted from the cumulative frequency for that score. This percentile rank is taken as the corresponding quantile of a standard normal variable (cf. Gulliksen, 1987). Cscores for adolescents are presented in Table B2 and those for adults, in Table B3.
Standard norms allow for direct comparisons between a participant’s scores observed in both scales (Willmes, 2010). Namely, one can check, whether differences in standard scores between scales are unlikely (at a given alpha level) to originate purely from measurement error (so called reliability aspect). Furthermore, it is also possible to check, whether differences in standardized scores are unlikely (at a given alpha level) to occur in the reference population (so called diagnostic validity aspect).
The critical differences for both reliability and validity aspect should be calculated on so called τstandardized cscores, which take into consideration possible differences in reliabilities of the scales to be compared regarding differences in true performance level in a given reference group. These scores are reported in Tables B2 and B3Table B3 for adolescents and adults respectively.
In Table B4 critical values for differences between τstandardized cscores are presented at various alpha levels. In case of diagnostic validity, it seems that a more liberal approach should be adapted, as it may be more important not to commit the typeII error (i.e., overlooking a possible true difference in performance).
Example
Here we present a stepby step example on how to use the norms we provide. Participant A is an adult male and was administered AMAS in the online form. His total raw score was 26, raw scores in the Learning and Testing scales were 6 and 20 respectively. Thus, based on Table B1, his percentile score corresponding to the total raw score is 80 (i.e., 80% of the reference population do not score higher). Percentile scores for the Learning and Testing scales are 30 and 95 respectively. His cscores are 7 for the total score, 5 for the Learning scale and an exceptionally high standardized cscore of 11 for the Testing scale.
In the next step one might aim to test, whether there is a difference in true standardized scores obtained by participant A in the Learning and Testing scales or whether this difference of 6 cscores is unlikely to originate from measurement error only (i.e., the reliability aspect). One needs to first look up τstandardized cscores corresponding to participant A’s raw scores in Table B3. In case of the Learning scale it is 5.0 and for the Testing scale it is 11.71. Thus, the difference in τstandardized cscores equals 6.71. Subsequently, the obtained difference must be compared with critical values from Table B4. For the reliability aspect, the critical difference at a typeI error level of .05 equals 3.33. The difference observed in Participant A exceeds this value, thus one may say with 95% confidence that the difference does not originate purely from measurement error.
One might additionally ask whether such a difference is unlikely in the reference population (i.e., the diagnostic validity aspect). Again, one needs to compare the difference in the τstandardized cscores (i.e., 6.71) with the respective critical value from Table B4. For the validity aspect, the critical difference at 10% in the corresponding group (adult males, online administration) equals 5.94. Thus, in our case one may say that a higher score difference has a probability of less than 10% in the reference population.
Discussion
Overview
In this study, we investigated the usefulness of the AMAS as a tool for online measurement of math anxiety. The results from a large Polish sample provide further support for the validity of the math anxiety construct as well as the quality of the AMAS as a measurement instrument.
Such measurement method invariance across countries and cultures should not be assumed without empirical support. A vast majority of studies on similarities and differences between administration methods available to Englishspeaking readers were conducted within AngloSaxon cultures, including data on AMAS measurement invariance obtained up to date. Middle European countries, despite overall similarities, differ in several aspects from AngloSaxon cultures (e.g., Lee, 2009; OECD, 2013). Furthermore, one should also keep in mind that in Middle and Eastern European countries, personal computers and especially affordable fast Internet connections became available for the average user a few years later than in the U.S.
One may argue that general cultural differences between Polish and AngloSaxon cultures are rather small and therefore measurement method invariance should not be questioned even if we take into account possible differences in popularity of personal computers. However, some characteristics of math anxiety in Polish adolescents, as shown in the PISA 2012 study (OECD, 2013), point to some specifics that legitimize such studies. Namely, the observed relationship between MA and math performance is one of the strongest across all countries and economies involved in the PISA program. Furthermore, the gender gap was considerably below the PISA average. Because of these remarkable cultural specificities, it is far from clear that MA results from Western cultures generalize to other cultures, namely, a Slavic culture here. Importantly, the PISA study provides data about adolescents, however much less is known about commonalities and differences in math anxiety in the adult population. However, we observed that the AMAS scale remained invariant across administration methods in Poland as well, with regard to construct validity (reflected by confirmatory factor analysis) and reliability estimates. This is good news for international assessment and crosscultural research, because it seems that we are assessing largely the same construct, no matter if we do it online or in paperandpencil administration.
Online Versus PaperandPencil Measurement
Ashcraft and Faust (1994) claim that higher MA scores observed in case of computerized administration are due to the fact that math anxious individuals do not feel comfortable with computers. We hypothesized that, although this may have been relevant in the early 90’s, this claim is no longer valid. Nowadays, computers are much more ubiquitous and userfriendly in general, so most likely they do not induce additional anxiety, especially regarding typical computer usage (e.g., filling out an online survey, which does not differ from typical website activities). A comparison with a previous study (Cipora et al., 2015a) showed a small difference in AMAS scores depending on administration methods. Notably, the difference we observed was in the opposite direction (i.e., participants administered with the online form scored lower rather than higher on MA) than the effect reported by Ashcraft and Faust (1994). Nevertheless, the corresponding effect sizes were very small, and according to Cohen’s (1988) recommendations they should not be considered as practically meaningful. Therefore, based on the results of the present study, we conclude that the AMAS can be administered online for practical screening purposes etc. However, if researchers look for small differences between groups (cultures, age groups, gender etc.), they should not administer the AMAS online to one group and per paperandpencil in another.
AMAS Structure
The twofactorial structure of the AMAS proposed by its authors was present in Polish participants as well. The only difference is that the item referring to homework can be assigned to both scales. This can make sense, because homework has a learning aspect (because math is practiced at home), but can also have a testing aspect, e.g., when participants anticipate being controlled by a teacher later.
The overall stability of factorial structure can be considered as further evidence of AMAS validity. However, a stable twofactorial structure of the AMAS does not require a twocomponential structure of the construct of interest. It is still possible that several aspects of MA may not be covered by AMAS items. For instance, Pletzer et al. (2016) showed that MA consists of more than two components, e.g., anxiety related to being tested and anxiety related to anticipation of an upcoming math test. This is of particular importance when it comes to interventions aimed at reducing aspects of MA (see Pletzer et al., 2016).
However, the AMAS is much shorter than the instrument used by Pletzer et al. (2016). Some potential aspects of math anxiety like social responsibility are not covered in the AMAS. Other aspects may be less covered (e.g., by single items). Therefore, seemingly inconsistent factor structures can be reconciled if one acknowledges that the structure depends on the instrument and the exact questions being asked in it. This discrepancy calls for a theoretical debate about which items should be included in math anxiety assessment. Currently, it appears that despite its usefulness for screening and research purposes, AMAS usefulness for individual diagnosis, especially in case of differential intervention planning, should be tested in future studies.
Gender Differences in Math Anxiety
Compared to the paperandpencil study by Cipora et al. (2015a), gender differences in AMAS scores were considerably smaller. Interestingly, we found that gender differences differed depending on the field of study category and were in fact present only in the STEM group, but not in other fields of studies categories for which we could perform such calculations (i.e., psychology and “neuropsychology, neurobiology, and cognitive science” categories). Importantly, it is also in line with results of a paperandpencil study (Cipora et al., 2015a), where no gender differences were observed when only psychology students were considered. However, an individual’s selection of a field of study is, among other factors, guided by one’s attitudes, interests, stereotypes and math anxiety (Ashcraft & Ridley, 2005). Therefore, this issue requires further attention as far as gender differences in MA are concerned. One must also keep in mind that there may be a selection bias in case of online assessment, possibly causing individuals who are not representative of the general population (and their respective genders) to decide to participate. Unfortunately, in the current study setup, we were not able to obtain data on the number of participants who actually clicked the link but then decided not to submit their answers.
In general, our results suggest that one must be very cautious regarding claims about gender differences in MA. Especially they provide some hints on reasons for inconsistent results as regards gender differences in math anxiety. These inconsistencies may at least to some extent originate from the fact that different samples are tested as regards participants’ current field of study. Furthermore, one must also keep in mind that the size of gender differences differs largely between countries and cultures, as also shown in the PISA study.
Results of both exploratory (Appendix A) and confirmatory factor analyses suggest that the factor structure does not differ substantially between genders, which means that apart from gender differences in average math anxiety, the internal structure of the construct remains unchanged across genders.
Limitations of the Presented Study
First of all, the sample of our study was very homogeneous in terms of educational background. Like many previous studies (and psychology studies in general), we tested mostly young adult university students, which limits extrapolating results of our study to the general population. On the other hand, the scores obtained by our participants cover a large part of the AMAS theoretical score range, so that reliability estimates and item characteristics are not affected by low systematic variability in the data.
Nevertheless, it would be worthwhile to test more varied samples, including people with varied highest educational levels reached. In particular, it would be useful to test participants who are not well familiarized with psychological measurement tools in general.
Another important limitation of the presented study was that it did not include any other measure, which precludes the possibility to investigate convergent and discriminant validity. It seems relatively unlikely that two administration forms being invariant regarding factor structure, reliabilities, mean scores, and variances, differ in correlations with external measures. Nevertheless, this should be addressed in future studies.
Importantly, our study did not involve any mathrelated activity. The participants were only asked to recall from memory how anxious they will feel if they are or would be confronted with several mathrelated situations. As thoroughly discussed by Bieg, Goetz, Wolter, and Hall (2015), such traitlike anxiety ratings can differ from those obtained after anxietyevoking exposure: traitlike ratings are usually overestimates. Therefore, comparing studies in which participants were or were not exposed to math and were assessed for their MA prior or after the exposure cannot be considered equivalent a priori. This also requires future investigation.
General Conclusions
The results of our study provide further evidence for the usefulness of the AMAS as a math anxiety measure. Cross–cultural invariance, together with measurement method invariance, strongly suggests validity and generalizability of the MA construct. Interestingly, average AMAS scores reported from different countries also seem to be relatively similar (Dykeman, 2017). Furthermore, the combined results of numerous studies suggest that the AMAS validly reflects the MA construct. Further research is still needed to examine the finegrained structure of MA. It also seems that the AMAS may be useful as a MA measurement instrument in varied settings, i.e., paperandpencil, computerized, and online. Therefore, the AMAS may be useful as an additional measure in various fields of numerical cognition since there is a growing body of evidence for the role of MA to be involved in several aspects of human number processing. In order to make practical use of what we know on math anxiety, it is important that practitioners are equipped with normative data so that they may make more informed choices about individuals who are potentially in threat of high math anxiety. We hope that future researchers will also make an extra effort to prepare adequate AMAS norms for several populations.