Norms and Validation of the Online and Paper-and-Pencil Versions of the Abbreviated Math Anxiety Scale ( AMAS ) For Polish Adolescents and Adults

The Abbreviated Math Anxiety Scale (AMAS) is one of the most popular instruments measuring math anxiety (MA). It has been validated across several linguistic and cultural contexts. In this study, we investigated the extent of administration method invariance of the AMAS by comparing results (average scores, reliabilities, factorial structure) obtained online with those from paper-and-pencil. We administered the online version of the AMAS to Polish students. Results indicate that psychometric properties of the AMAS do not differ between online and paper-and-pencil administration. Additionally, average scores of the AMAS did not differ considerably between administration forms, contrary to previous results showing that computerized measurement of MA leads to higher scores. Therefore, our results provide evidence for the usefulness of the AMAS as a reliable and valid MA measurement tool for online research and online screening purposes across cultures and also large similarity between administration forms outside an American-English linguistic and cultural context. Finally, we provide percentile and standard norms for the AMAS for adolescents and adults (in the latter case for both online and paper-and-pencil administration) as well as critical differences for the comparison of both subscales in an individual participant for practical diagnostic purposes.

3-factor models: A three-component model was proposed by Alexander and Martray (1989) with the factors (1) Math test anxiety, (2) Numerical Task Anxiety, and (3) Math Course Anxiety.Three factors also account best for the pattern of scores in the Mathematics Anxiety Scale -UK (MAS-UK) that was developed for the British population (Hunt et al., 2011).These factors were termed (1) Maths Evaluation Anxiety, (2) Everyday / Social Maths Anxiety, (3) Maths Observation Anxiety.Sometimes abstraction anxiety (anxiety related to abstract mathematical content) is also considered to be another component of MA (Ma & Xu, 2004).
6-factor models: Analyzing several tools measuring MA, Kazelskis (1998) described 6 oblique factors of math anxiety involving: (1) Mathematics Test Anxiety, (2) Numerical Anxiety, (3) Negative Affect Toward Mathematics, (4) Worry, (5) Positive Affect Toward Mathematics, and (6) Mathematics Course Anxiety.This factor structure did not hold in the light of subsequent research.However, positive affect towards mathematics is sometimes considered as a component of math anxiety (Bai, 2011).The questionnaire MAS-R developed by this latter author was shown to reliably measure two independent aspects of math anxiety -positive and negative.Based on a confirmatory factor analysis of the MARS30-Brief scale, Pletzer et al. (2016) conclude that a simple distinction between numerical anxiety and testing anxiety does not satisfactorily fit the data.Best fit was obtained by a model consisting of six factors (1) Evaluation Anxiety 1 -proper -Taking Math Exam, (2) Evaluation Anxiety 2 -Thinking of Upcoming Exam, (3) Learning Math Anxiety, (4) Everyday Numerical Anxiety, (5) Performance Anxiety, (6) Social Responsibility Anxiety.
In sum, the factorial structure of math anxiety is relatively variable and different structural proposals currently exist in the literature, however, most researchers agree that it is not a unidimensional construct.

AMAS -Abbreviated Math Anxiety Scale
One of the more recent instruments for measuring MA is the Abbreviated Math Anxiety Scale (AMAS) developed by Hopko et al. (2003).Since we are using the AMAS in this study, we outline its diagnostic properties in more detail.

Overall Characteristics
AMAS is a 9-item questionnaire, characterized by very good psychometric properties.It takes less than 5 minutes to administer, making it very convenient as a screening tool and for research purposes.Apart from the total score, AMAS comprises two scales: (1) anxiety related to learning math (Learning), and (2) anxiety related to being tested in math (Testing).This structure was confirmed in several samples using both exploratory and confirmatory factor analyses (e.g., Hopko et al., 2003;Cipora et al., 2015aCipora et al., , 2015b)).
In sum, while AMAS reliability can be considered satisfactorily for a personality questionnaire in general, there are some cultural variations and especially some lower reliabilities for younger children.
AMAS was also shown to have expected patterns of correlations with more general differential psychology constructs, namely temperamental traits (e.g., .48correlation with Emotional Reactivity; no correlation with Sensory Sensitivity nor Activity), as well as attitudes towards math and other school subjects (e.g., -.50 correlation with liking math and .12correlation with liking humanities; Cipora et al., 2015aCipora et al., , 2015b)).
Importantly, the AMAS structure was shown to be gender-invariant, which was found both for the original English version (Hopko et al., 2003) as well as for other language versions including Farsi (Vahedi & Farrokhi, 2011), Italian (Caviola et al., 2017;Primi et al., 2014) and Polish (Cipora et al., 2015a).
The above results suggest that the AMAS can validly measure MA in various linguistic and cultural contexts.
Therefore, the AMAS is also potentially suitable for (cross-cultural) online studies.However, with the rise of online assessment, validity has become a crucial question.In the next section, we will discuss commonalities and specificities of online questionnaire measurement employed in this study compared to conventional assessment.

Paper-and-Pencil Administration
Online psychological measurement is becoming more and more popular.Advantages and disadvantages of online studies are extensively discussed by Reips (2002;

Objectives of the Study
The AMAS has been previously used in online administration mostly in English-speaking samples.Here, we evaluated reliability and validity of the AMAS online version in a different linguistic and cultural background, because of cross-cultural differences reported for offline versions of math anxiety assessment.However, equivalence with paper-and-pencil administration cannot be taken for granted as the above review showed -so the construct validity and the norms of the online version will be compared to a paper-and-pencil version.
Moreover, the comparison of online assessment with paper-and-pencil assessment allows us to test whether early claims of differences between MA scores between computer and paper-and-pencil administration still hold, when online assessment and computer use has become much more common (Ashcraft & Faust, 1994).
Finally, for investigating any individual characteristics, both researchers and practitioners should be equipped with adequate normative data.Thus, following up on the initiative by Caviola et al. (2017), who published AMAS percentile norms for Italian primary school children, we prepared norms for the Polish AMAS based on data we collected up to date (n = 2057) and we hope this initiative will be taken up by other researchers.

Method Participants
A total of 615 participants (418 female and 197 male), mean age 22.0 years (SD = 3.9; range 17-50) participated in the study.
Participants were recruited via email circulated through university mail at several universities in Poland in various cities (Kraków, Lublin, Warszawa, Toruń, Wrocław) as well as by sharing a link via social media.In the announcement, we invited individuals to participate in a study, and if possible, to share the invitation with other potentially interested individuals.In the invitation it was stated that the whole procedure will take about 3 minutes and that the scale comprises only nine items.It also stated that the aim of the study was to prepare the validation of a questionnaire aimed at measuring a specific kind of anxiety in a Polish sample.All receivers were free to ignore the email and did not receive any compensation for their participation.All participants received the same link, therefore ensuring data anonymization.
Most participants were university students; however, some participants recruited via social media were also employees.They represented a varied range of fields of studies / occupations: Psychology (n = 380), STEM (science, technology, engineering, and mathematics; n = 141; comprising: engineering -varied specializations, computer science, architecture etc.), neurobiology / neuropsychology / cognitive science (n = 54), humanities and pedagogy (n = 27; comprising: pedagogy, philosophy, literature etc.), other (n = 11) and two individuals did not answer this question.(In the shared dataset one can access the study sample composition by gender.This can be done by running the specified parts of the shared R code, see Supplementary Material.)

Materials and Procedure
The online version of the AMAS questionnaire was identical to the paper-and-pencil version in terms of content and scale formation (see Cipora et al., 2015a).The only difference was that a nickname, age, gender, and field of study sections were presented at the bottom of the website.Cipora,Willmes,Szwarc,& Nuerk 673 The form started with an instruction stating that the participant will see some statements related to learning math.She or he was asked to mark the level of anxiety it evokes/would evoke in her/him.Responses were given on a 5-point Likert scale.Scale extremes were labelled with mild anxiety and strong anxiety.
The Google Documents form was used in the default graphic.All items were marked as obligatory.The entire questionnaire fit on one page.After clicking the link, participants were redirected to the Google Documents form.The whole procedure lasted approximately 3 minutes.
The theoretical range of sum scores across all nine items (AMAS total score) was 9-45, whereas for the Learning and Testing scales it was 5-25 and 4-20 respectively.

Data Analysis
In the present work, we examined the reliability and construct validity of the AMAS questionnaire.We compared the AMAS results obtained from this online study with the previous findings for paper-and-pencil administration to evaluate commonalities and differences between online and paper-and-pencil administration.
To estimate AMAS reliability, we measured Cronbach's alpha, which is one of the most popular reliability measures.Nevertheless, feasibility of this reliability measure has been strongly criticized recently.Critics point out that it is computed using Pearson correlation coefficients among items, which may be biased when the raw data is not continuous and the number of different item-scale values is small, which is the case for Likert-type responses.Therefore, Cronbach's alpha may underestimate reliability, and this problem can be even more severe for scales comprising only a small number of items like the AMAS (Yang & Green, 2011).Another source of problems may originate from non-normal distributions of both true scores and error scores (Sheng & Sheng, 2012).Ordinal alpha, computed using the polychoric correlation coefficient, can be an alternative that addresses these issues.Polychoric correlations take into account the fact that the observed data (i.e., Likerttype data) are discrete manifestations of a continuous latent construct (Zumbo, Gadermann, & Zeisser, 2007).
We investigated the AMAS factor structure by means of confirmatory factor analysis.Namely, we examined whether the factor structure of AMAS observed in the previous study can be replicated in an independent dataset collected with a different administration form.Additionally, we conducted an exploratory factor analysis (Appendix A) with oblique rotation, because the theoretical expectation is that different aspects of MA are inherently correlated.
Lastly, we compared the AMAS results obtained via online administration with the data obtained in a previous study (Cipora et al., 2015a) based on the paper-and-pencil setup of the AMAS.In the case of reliability estimates and correlations, we also calculated 95% confidence intervals to compare reliabilities obtained via online and paper-and-pencil administrations.
Analyses were conducted in R (R Core Team, 2017) and SPSS 24.The confirmatory factor analysis was conducted in AMOS 24 software (Arbuckle, 2016).

Descriptive Statistics
The average AMAS total score was 20.96 (SD = 6.22), slightly below the theoretic scale midpoint of 27.The distribution deviated significantly from normality as assessed with the Shapiro-Wilk test, W(615) = 0.98, p < .001.Skewness was 0.43 (SE = 0.10) and kurtosis was -0.24 (SE = 0.20).Skewness divided by its standard error falls outside the ±2 range, so the distribution may be considered to be skewed.However, with respect to kurtosis, there was no significant departure from normality.The average score for the Learning scale was 7.59 divided by its SE falls within ±2 range, so there is no major deviation from normality for this parameter, although this is not the case for the kurtosis estimate.Distributions of results for the scale total as well as for both scales are depicted in Figure 1.Descriptive statistics for all items individually are also reported in Table 1.By conducting reliability analysis with supplementary R code one can access a table containing proportion of each response alternatives to each For the total score as well as for both sub-scales there were no differences in variance between genders (Levene's tests all ps > .139).Female participants scored higher than males.In females, the average total score was 21.62 (SD = 6.33) and for males it was 19.54 (SD = 5.73), t(613) = -3.91,p < .001;d = 0.34.
Noteworthy, when the analysis was conducted separately within each field of study, the differences were present in the STEM (i.e., science, technology, engineering, and mathematics) group only (d = 0.53).By running the supplementary R code one may access more detailed statistics.
In the Learning scale, the mean score for females was 8.17 (SD = 3.22), and for males it was 7.59 (SD = 2.80),  a In the case of correlation with the total score and the scale a given item is assigned to, corrected item-total correlations are provided (i.e., correlations with the scale score excluding the given item); in case of correlations with the other scale, bivariate correlations between the item and the scale score are presented.b Allocated to scale with higher loading.

Ordinal Alpha
In order to estimate ordinal alpha, we used the method proposed by Gadermann, Guhn, and Zumbo (2012).
Ordinal alpha was .88,.80,and .88 for the total scores, Learning, and Testing scales, respectively.Reliabilities did not increase after any item was excluded.

Construct Validity: AMAS Factor Structure -Confirmatory Factor Analysis
In the second step of analysis, we performed a confirmatory factor analysis on our data.We examined the fit of the model that attained the best fit for our paper-and-pencil data (see Cipora et al., 2015aCipora et al., , 2015b)).The model reflects the structure of the AMAS postulated by Hopko et al. (2003), with the only exception being item no. 5 (Homework), which was modeled to contribute to both latent variables.Nevertheless, the model comprising single loading for this item also reached acceptable fit to the data.
The multivariate normality assumption was violated (multivariate kurtosis = 26.73;critical ratio = 23.55).Therefore, similar to the paper-and-pencil data, we used an asymptotically distribution-free (ADF) method.This method is suited for modeling Likert-type data and may be considered equivalent to estimation based on polychoric correlations.For the same reason, CMIN/DF measures of model fit are not reported, since they are sensitive to violations of the normality assumption (Bedyńska & Książek, 2012).The model, together with standardized path coefficients, is presented in Figure 2. All parameters differed significantly from 0 which means that all factor loadings were significant.Noteworthy, the same model reached acceptable fit for both female and male participants groups separately.Cipora, Willmes, Szwarc, & Nuerk 677 In the next step, we investigated invariance of the factor structure across genders.As AMOS 24 does not allow multi-group analysis when the ADF method is used, we conducted another analysis using the maximum likelihood method.We estimated two nested models, unconstrained and the model with the same measurement weights between genders.Formal comparison of χ 2 values revealed that the unconstrained model fit the data better than the constrained one (p = .001).On the other hand, using χ 2 statistic to evaluate overall model fit is problematic (i.e., in case of large sample sizes the χ 2 statistic almost always rejects a model; Jöreskog & Sörbom, 1993).Using this method for model comparison raises the same problems (Schermelleh-Engel, Moosbrugger, & Müller, 2003).Overall fit of both the constrained and unconstrained model estimated by means of RMSEA was .056for the unconstrained and .57for the constrained model, and confidence intervals were .046-.067 and .047-.067 respectively.

Format Commonalities and Differences: Online Versus Paper-and-Pencil Administration
In order to further evaluate the online version of the AMAS questionnaire, we compared the results obtained for the AMAS online with a previous Polish AMAS study, where data had been collected by paper-and-pencil administration (Cipora et al., 2015a).The comparison is presented in Table 2. Mean scores differed between administration forms.However, effect sizes related to these differences can be considered as small.In the online study, we observed smaller gender differences than in the paper-and-pencil study.
Correlations between scales and between scales and total scale were very similar and their confidence intervals largely overlap.The same is true for all reliability estimates we used.

Norms
Having administered AMAS in three separate studies (this contribution; Cipora et al., 2015aCipora et al., , 2015b;; this study was not published, however, its results were presented as a conference poster which is available at http://doi.org/10.17605/OSF.IO/QB768), we had data from 2057 individuals of varied age and education.Thus, it was possible to establish AMAS norms for different age groups and administration methods.The normative data was prepared for (1) secondary schoolers (grades 7-9); (2) high schoolers (grades 10-12); (3) adults tested with paper-and-pencil questionnaires, and (4) adults tested online.For each group norms were prepared for female and male participants separately.All tables are presented in Appendix B.

Percentile Norms
Percentile norms indicate the percentage of participants who score below and up to a given score (i.e., percentage of participants scoring below + 50% of the participants obtaining a given raw score, according to the proposal by Crawford, Garthwaite, and (2009).To make our norms comparable to those reported by Caviola et al. (2017), in the first step we prepared percentile norms (Table B1).

Standard Norms
Additionally, we prepared standard c-norms (M = 5; SD = 2), since a standard norm is required to be able to compare the performance of an individual participant with inferential statistical procedures (cf.Willmes, 2010).
The probability transform was used to approximate normal distributions for originally skewed raw data and smoothing of the raw score distribution was used to mimic a continuous variable.In order to obtain z-quantile scores for each raw score, half of the frequency of that score is subtracted from the cumulative frequency for that score.This percentile rank is taken as the corresponding quantile of a standard normal variable (cf.Gulliksen, 1987).C-scores for adolescents are presented in Table B2 and those for adults, in Table B3.
Standard norms allow for direct comparisons between a participant's scores observed in both scales (Willmes, 2010).Namely, one can check, whether differences in standard scores between scales are unlikely (at a given alpha level) to originate purely from measurement error (so called reliability aspect).Furthermore, it is also possible to check, whether differences in standardized scores are unlikely (at a given alpha level) to occur in the reference population (so called diagnostic validity aspect).
The critical differences for both reliability and validity aspect should be calculated on so called τ-standardized cscores, which take into consideration possible differences in reliabilities of the scales to be compared regarding differences in true performance level in a given reference group.These scores are reported in Tables B2 and   B3 for adolescents and adults respectively.
In Table B4 critical values for differences between τ-standardized c-scores are presented at various alpha levels.In case of diagnostic validity, it seems that a more liberal approach should be adapted, as it may be more important not to commit the type-II error (i.e., overlooking a possible true difference in performance).

Example
Here we present a step-by step example on how to use the norms we provide.Participant A is an adult male and was administered AMAS in the online form.His total raw score was 26, raw scores in the Learning and Testing scales were 6 and 20 respectively.Thus, based on Table B1, his percentile score corresponding to the total raw score is 80 (i.e., 80% of the reference population do not score higher).Percentile scores for the Learning and Testing scales are 30 and 95 respectively.His c-scores are 7 for the total score, 5 for the Learning scale and an exceptionally high standardized c-score of 11 for the Testing scale.
In the next step one might aim to test, whether there is a difference in true standardized scores obtained by participant A in the Learning and Testing scales or whether this difference of 6 c-scores is unlikely to originate from measurement error only (i.e., the reliability aspect).One needs to first look up τ-standardized c-scores corresponding to participant A's raw scores in Table B3.In case of the Learning scale it is 5.0 and for the Testing scale it is 11.71.Thus, the difference in τ-standardized c-scores equals 6.71.Subsequently, the obtained difference must be compared with critical values from Table B4.For the reliability aspect, the critical difference at a type-I error level of .05equals 3.33.The difference observed in Participant A exceeds this value, thus one may say with 95% confidence that the difference does not originate purely from measurement error.
One might additionally ask whether such a difference is unlikely in the reference population (i.e., the diagnostic validity aspect).Again, one needs to compare the difference in the τ-standardized c-scores (i.e., 6.71) with the respective critical value from Table B4.For the validity aspect, the critical difference at 10% in the corresponding group (adult males, online administration) equals 5.94.Thus, in our case one may say that a higher score difference has a probability of less than 10% in the reference population.

Overview
In this study, we investigated the usefulness of the AMAS as a tool for online measurement of math anxiety.
The results from a large Polish sample provide further support for the validity of the math anxiety construct as well as the quality of the AMAS as a measurement instrument.
Such measurement method invariance across countries and cultures should not be assumed without empirical support.A vast majority of studies on similarities and differences between administration methods available to English-speaking readers were conducted within Anglo-Saxon cultures, including data on AMAS measurement invariance obtained up to date.Middle European countries, despite overall similarities, differ in several aspects from Anglo-Saxon cultures (e.g., Lee, 2009;OECD, 2013).Furthermore, one should also keep in mind that in Middle and Eastern European countries, personal computers and especially affordable fast Internet connections became available for the average user a few years later than in the U.S.
One may argue that general cultural differences between Polish and Anglo-Saxon cultures are rather small and therefore measurement method invariance should not be questioned even if we take into account possible differences in popularity of personal computers.However, some characteristics of math anxiety in Polish adolescents, as shown in the PISA 2012 study (OECD, 2013), point to some specifics that legitimize such studies.Namely, the observed relationship between MA and math performance is one of the strongest across theoretical debate about which items should be included in math anxiety assessment.Currently, it appears that despite its usefulness for screening and research purposes, AMAS usefulness for individual diagnosis, especially in case of differential intervention planning, should be tested in future studies.

Gender Differences in Math Anxiety
Compared to the paper-and-pencil study by Cipora et al. (2015a), gender differences in AMAS scores were considerably smaller.Interestingly, we found that gender differences differed depending on the field of study category and were in fact present only in the STEM group, but not in other fields of studies categories for which we could perform such calculations (i.e., psychology and "neuropsychology, neurobiology, and cognitive science" categories).Importantly, it is also in line with results of a paper-and-pencil study (Cipora et al., 2015a), where no gender differences were observed when only psychology students were considered.However, an individual's selection of a field of study is, among other factors, guided by one's attitudes, interests, stereotypes and math anxiety (Ashcraft & Ridley, 2005).Therefore, this issue requires further attention as far as gender differences in MA are concerned.One must also keep in mind that there may be a selection bias in case of online assessment, possibly causing individuals who are not representative of the general population (and their respective genders) to decide to participate.Unfortunately, in the current study setup, we were not able to obtain data on the number of participants who actually clicked the link but then decided not to submit their answers.
In general, our results suggest that one must be very cautious regarding claims about gender differences in MA.Especially they provide some hints on reasons for inconsistent results as regards gender differences in math anxiety.These inconsistencies may at least to some extent originate from the fact that different samples are tested as regards participants' current field of study.Furthermore, one must also keep in mind that the size of gender differences differs largely between countries and cultures, as also shown in the PISA study.
Results of both exploratory (Appendix A) and confirmatory factor analyses suggest that the factor structure does not differ substantially between genders, which means that apart from gender differences in average math anxiety, the internal structure of the construct remains unchanged across genders.

Limitations of the Presented Study
First of all, the sample of our study was very homogeneous in terms of educational background.Like many previous studies (and psychology studies in general), we tested mostly young adult university students, which limits extrapolating results of our study to the general population.On the other hand, the scores obtained by our participants cover a large part of the AMAS theoretical score range, so that reliability estimates and item characteristics are not affected by low systematic variability in the data.
Nevertheless, it would be worthwhile to test more varied samples, including people with varied highest educational levels reached.In particular, it would be useful to test participants who are not well familiarized with psychological measurement tools in general.
Another important limitation of the presented study was that it did not include any other measure, which precludes the possibility to investigate convergent and discriminant validity.It seems relatively unlikely that two administration forms being invariant regarding factor structure, reliabilities, mean scores, and variances, differ in correlations with external measures.Nevertheless, this should be addressed in future studies.
Importantly, our study did not involve any math-related activity.The participants were only asked to recall from memory how anxious they will feel if they are or would be confronted with several math-related situations.As thoroughly discussed by Bieg, Goetz, Wolter, and Hall (2015), such trait-like anxiety ratings can differ from those obtained after anxiety-evoking exposure: trait-like ratings are usually overestimates.Therefore, comparing studies in which participants were or were not exposed to math and were assessed for their MA prior or after the exposure cannot be considered equivalent a priori.This also requires future investigation.

General Conclusions
The results of our study provide further evidence for the usefulness of the AMAS as a math anxiety measure.
Cross-cultural invariance, together with measurement method invariance, strongly suggests validity and generalizability of the MA construct.Interestingly, average AMAS scores reported from different countries also seem to be relatively similar (Dykeman, 2017).Furthermore, the combined results of numerous studies suggest that the AMAS validly reflects the MA construct.Further research is still needed to examine the fine-grained structure of MA.It also seems that the AMAS may be useful as a MA measurement instrument in varied settings, i.e., paper-and-pencil, computerized, and online.Therefore, the AMAS may be useful as an additional measure in various fields of numerical cognition since there is a growing body of evidence for the role of MA to be involved in several aspects of human number processing.In order to make practical use of what we know on math anxiety, it is important that practitioners are equipped with normative data so that they may make more informed choices about individuals who are potentially in threat of high math anxiety.We hope that future researchers will also make an extra effort to prepare adequate AMAS norms for several populations.

Figure 1 .
Figure 1.Distributions of AMAS online.Total score (Panel A; theoretical range from 9 to 45), AMAS Learning scale (Panel B; theoretical range from 5 to 25) and AMAS Testing scale (Panel C; theoretical range from 4 to 20).
item.By using another part of the code one may generate a plot presenting frequencies of given response alternatives for each item.The Pearson correlation between both scales was .55.Correlation between the Learning scale and the total score was .85,and the correlation between the Testing scale and the total score was .91.Cipora, Willmes, Szwarc, & Nuerk 675

Figure 2 .
Figure 2. Confirmatory factor analysis on AMAS online data.Indices of the model fit are presented in the main text.Standardized coefficients are presented along with the respective paths.Internal structure of the AMAS online data is highly similar to paper-and-pencil data.Variables labelled with e1, e2, etc. denote respective error terms.
see in particular their Table 1, p. 245; see also Reips, Buchanan, Krantz, & McGraw, 2015 for guidelines on online data interpretation).Online questionnaires are

Table 2
Comparison Between AMAS Online and Paper-and-Pencil Administration

Table B4
Critical Differences Between Scores at Different Alpha Levels