analysis of variance

María J. Blanca, Rafael Alarcón, Jaume Arnau, Roser Bono and Rebecca Bendayan 552 One-way analysis of variance (ANOVA) or F-test is one of the most common statistical techniques in educational and psychological research (Keselman et al., 1998; Kieffer, Reese, & Thompson, 2001). The F-test assumes that the outcome variable is normally and independently distributed with equal variances among groups. However, real data are often not normally distributed and variances are not always equal. With regard to normality, Micceri (1989) analyzed 440 distributions from ability and psychometric measures and found that most of them were contaminated, including different types of tail weight (uniform to double exponential) and different classes of asymmetry. Blanca, Arnau, López-Montiel, Bono, and Bendayan (2013) analyzed 693 real datasets from psychological variables and found that 80% of them presented values of skewness and kurtosis ranging between -1.25 and 1.25, with extreme departures from the normal distribution being infrequent. These results were consistent with other studies with real data (e.g., Harvey & Siddique, 2000; Kobayashi, 2005; Van Der Linder, 2006). The effect of non-normality on F-test robustness has, since the 1930s, been extensively studied under a wide variety of conditions. As our aim is to examine the independent effect of non-normality the literature review focuses on studies that assumed variance homogeneity. Monte Carlo studies have considered unknown and known distributions such as mixed non-normal, lognormal, Poisson, exponential, uniform, chi-square, double exponential, Student’s t, binomial, gamma, Cauchy, and beta (Black, Ard, Smith, & Schibik, 2010; Bünning, 1997; Clinch & Kesselman, 1982; Feir-Walsh & Thoothaker, 1974; Gamage & Weerahandi, 1998; Lix, Keselman, & Keselman, 1996; Patrick, 2007; Schmider, Ziegler, Danay, Beyer, & Bühner, 2010). One of the fi rst studies on this topic was carried out by Pearson (1931), who found that F-test was valid provided that the deviation from normality was not extreme and the number of degrees of freedom apportioned to the residual variation was not too small. Norton (1951, cit. Lindquist, 1953) analyzed the effect of distribution shape on robustness (considering either that the distributions had the same shape in all the groups or a different shape in each group) ISSN 0214 – 9915 CODEN PSOTEG Copyright © 2017 Psicothema www.psicothema.com Non-normal data: Is ANOVA still a valid option? María J. Blanca1, Rafael Alarcón1, Jaume Arnau2, Roser Bono2 and Rebecca Bendayan1,3 1 Universidad de Málaga, 2 Universidad de Barcelona and 3 MRC Unit for Lifelong Health and Ageing, University College London Abstract Resumen Background: The robustness of F-test to non-normality has been studied from the 1930s through to the present day. However, this extensive body of research has yielded contradictory results, there being evidence both for and against its robustness. This study provides a systematic examination of F-test robustness to violations of normality in terms of Type I error, considering a wide variety of distributions commonly found in the health and social sciences. Method: We conducted a Monte Carlo simulation study involving a design with three groups and several known and unknown distributions. The manipulated variables were: Equal and unequal group sample sizes; group sample size and total sample size; coeffi cient of sample size variation; shape of the distribution and equal or unequal shapes of the group distributions; and pairing of group size with the degree of contamination in the distribution. Results: The results showed that in terms of Type I error the F-test was robust in 100% of the cases studied, independently of the manipulated conditions. Keywords: F-test, ANOVA, robustness, skewness, kurtosis. Datos no normales: ¿es el ANOVA una opción válida? Antecedentes: las consecuencias de la violación de la normalidad sobre la robustez del estadístico F han sido estudiadas desde 1930 y siguen siendo de interés en la actualidad. Sin embargo, aunque la investigación ha sido extensa, los resultados son contradictorios, encontrándose evidencia a favor y en contra de su robustez. El presente estudio presenta un análisis sistemático de la robustez del estadístico F en términos de error de Tipo I ante violaciones de la normalidad, considerando una amplia variedad de distribuciones frecuentemente encontradas en ciencias sociales y de la salud. Método: se ha realizado un estudio de simulación Monte Carlo considerando un diseño de tres grupos y diferentes distribuciones conocidas y no conocidas. Las variables manipuladas han sido: igualdad o desigualdad del tamaño de los grupos, tamaño muestral total y de los grupos; coefi ciente de variación del tamaño muestral; forma de la distribución e igualdad o desigualdad de la forma en los grupos; y emparejamiento entre el tamaño muestral con el grado de contaminación en la distribución. Resultados: los resultados muestran que el estadístico F es robusto en términos de error de Tipo I en el 100% de los casos estudiados, independientemente de las condiciones manipuladas. Palabras clave: estadístico F, ANOVA, robustez, asimetría, curtosis. Psicothema 2017, Vol. 29, No. 4, 552-557 doi: 10.7334/psicothema2016.383 Received: December 14, 2016 • Accepted: June 20, 2017 Corresponding author: María J. Blanca Facultad de Psicología Universidad de Málaga 29071 Málaga (Spain) e-mail: blamen@uma.es Non-normal data: Is ANOVA still a valid option? 553 and found that, in general, F-test was quite robust, the effect being negligible. Likewise, Tiku (1964) stated that distributions with skewness values in a different direction had a greater effect than did those with values in the same direction unless the degrees of freedom for error were fairly large. However, Glass, Peckham, and Sanders (1972) summarized these early studies and concluded that the procedure was affected by kurtosis, whereas skewness had very little effect. Conversely, Harwell, Rubinstein, Hayes, and Olds (1992), using meta-analytic techniques, found that skewness had more effect than kurtosis. A subsequent meta-analytic study by Lix et al. (1996) concluded that Type I error performance did not appear to be affected by non-normality. These inconsistencies may be attributable to the fact that a standard criterion has not been used to assess robustness, thus leading to different interpretations of the Type I error rate. The use of a single and standard criterion such as that proposed by Bradley (1978) would be helpful in this context. According to Bradley’s (1978) liberal criterion a statistical test is considered robust if the empirical Type I error rate is between .025 and .075 for a nominal alpha level of .05. In fact, had Bradley’s criterion of robustness been adopted in the abovementioned studies, many of their results would have been interpreted differently, leading to different conclusions. Furthermore, when this criterion is considered, more recent studies provide empirical evidence for the robustness of F-test under non-normality with homogeneity of variances (Black et al., 2010; Clinch & Keselman, 1982; Feir-Walsh & Thoothaker, 1974; Gamage & Weerahandi, 1998; Kanji, 1976; Lantz, 2013; Patrick, 2007; Schmider et al., 2010; Zijlstra, 2004). Based on most early studies, many classical handbooks on research methods in education and psychology draw the following conclusions: Moderate departures from normality are of little concern in the fi xed-effects analysis of variance (Montgomery, 1991); violations of normality do not constitute a serious problem, unless the violations are especially severe (Keppel, 1982); F-test is robust to moderate departures from normality when sample sizes are reasonably large and are equal (Winer, Brown, & Michels, 1991); and researchers do not need to be concerned about moderate departures from normality provided that the populations are homogeneous in form (Kirk, 2013). To summarize, F-test is robust to departures from normality when: a) the departure is moderate; b) the populations have the same distributional shape; and c) the sample sizes are large and equal. However, these conclusions are broad and ambiguous, and they are not helpful when it comes to deciding whether or not F-test can be used. The main problem is that expressions such as “moderate”, “severe” and “reasonably large sample size” are subject to different interpretations and, consequently, they do not constitute a standard guideline that helps applied researchers decide whether they can trust their F-test results under non-normality. Given this situation, the main goals of the present study are to provide a systematic examination of F-test robustness, in terms of Type I error, to violations of normality under homogeneity using a standard criterion such as that proposed by Bradley (1978). Specifi cally, we aim to answer the following questions: Is F-test robust to slight and moderate departures from normality? Is it robust to severe departures from normality? Is it sensitive to differences in shape among the groups? Does its robustness depend on the sample sizes? Is its robustness associated with equal or unequal sample sizes? To this end, we designed a Monte Carlo simulation study to examine the effect of a wide variety of distributions commonly found in the health and social sciences on the robustness of F-test. Distributions with a slight and moderate degree of contamination (Blanca et al., 2013) were simulated by generating distributions with values of skewness and kurtosis ranging between -1 and 1. Distributions with a severe degree of contamination (Micceri, 1989) were represented by exponential, double exponential, and chi-square with 8 degrees of freedom. In both cases, a wide range of sample sizes were considered with balanced and unbalanced designs and with equal and unequal distributions in groups. With unequal sample size and unequal shape in the groups, the pairing of group sample size with the degree of contamination in the distribution was also investigated. Method Instruments We conducted a Monte Carlo simulation study with non- normal data using SAS 9.4. (SAS Institute, 2013). Non-normal distributions were generated using the procedure proposed by Fleishman (1978), which uses a polynomial transformation to generate data with specifi c values of skewness and kurtosis. Procedure In order to examine the effect of non-normality on F-test robustness, a one-way design with 3 groups and homogeneity of variance was considered. The group effect was set to zero in the population model. The following variables were manipulated: 1. Equal and unequal group sample sizes. Unbalanced designs are more common than balanced designs in studies involving one-way and factorial ANOVA (Golinski & Cribbie, 2009; Keselman et al., 1998). Both were considered in order to extend our results to different research situations. 2. Group sample size and total sample size. A wide range of group sample sizes were considered, enabling us to study small, medium, and large sample sizes. With balanced designs the group sizes were set to 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, and 100, with total sample size ranging from 15 to 300. With unbalanced designs, group sizes were set between 5 and 160, with a mean group size of between 10 and 100 and total sample size ranging from 15 to 300. 3. Coeffi cient of sample size variation (Δn), which represents the amount of inequality in group sizes. This was computed by dividing the standard deviation of the group sample size by its mean. Different degrees of variation were considered and were grouped as low, medium, and high. A low Δn was fi xed at approximately 0.16 (0.141 – 0.178), a medium coeffi cient at 0.33 (0.316 – 0.334), and a high value at 0.50 (0.491 – 0.521). Keselman et al. (1998) showed that the ratio of the largest to the smallest group size was greater than 3 in 43.5% of cases. With Δn = 0.16 this ratio was equal to 1.5, with Δn = 0.33 it was equal to either 2.3 or 2.5, and with Δn = 0.50 it ranged from 3.3 to 5.7. 4. Shape of the distribution and equal and unequal shape in the groups. Twenty-two distributions were investigated, involving several degrees of deviation from normality and with both equal and unequal shape in the groups. For equal shape and slight and moderate departures from normality, María J. Blanca, Rafael Alarcón, Jaume Arnau, Roser Bono and Rebecca Bendayan 554 the distributions had values of skewness (γ 1 ) and kurtosis (γ 2 ) ranging between -1 and 1, these values being representative of real data (Blanca et al., 2013). The values of γ 1 and γ 2 are presented in Table 2 (distributions 1-12). For severe departures from normality, distributions had values of γ 1 and γ 2 corresponding to the double exponential, chi-square with 8 degrees of freedom, and exponential distributions (Table 2, distributions 13-15). For unequal shape, the values of γ 1 and γ 2 of each group are presented in Table 3. Distributions 16-21 correspond to slight and moderate departures from normality and distribution 22 to severe departure. 5. Pairing of group size with degree of contamination in the distribution. This condition was included with unequal shape and unequal sample size. The pairing was positive when the largest group size was associated with the greater contamination, and vice versa. The pairing was negative when the largest group size was associated with the smallest contamination, and vice versa. The specifi c conditions with unequal sample size are shown in Table 1. Ten thousand replications of the 1308 conditions resulting from the combination of the above variables were performed at a signifi cance level of .05. This number of replications was chosen to ensure reliable results (Bendayan, Arnau, Blanca, & Bono, 2014; Robey & Barcikowski, 1992). Data analysis Empirical Type I error rates associated with F-test were analyzed for each condition according to Bradley’s robustness criterion (1978). Results Tables 2 and 3 show descriptive statistics for the Type I error rate across conditions for equal and unequal shapes. Although the tables do not include all available information (due to article length limitations), the maximum and minimum values are suffi cient for assessing robustness. Full tables are available upon request from the corresponding author. All empirical Type I error rates were within the bounds of Bradley’s criterion. The results show that F-test is robust for 3 groups in 100% of cases, regardless of the degree of deviation from a normal distribution, sample size, balanced or unbalanced cells, and equal or unequal distribution in the groups. Discussion We aimed to provide a systematic examination of F-test robustness to violations of normality under homogeneity of variance, applying Bradley’s (1978) criterion. Specifi cally, we sought to answer the following question: Is F-test robust, in terms of Type I error, to slight, moderate, and severe departures from normality, with various sample sizes (equal or unequal sample size) and with same or different shapes in the groups? The answer to this question is a resounding yes, since F-test controlled Type I error to within the bounds of Bradley’s criterion. Specifi cally, the results show that F-test remains robust with 3 groups when distributions have values of skewness and kurtosis ranging between -1 and 1, as well as with data showing a greater departure from normality, such as the exponential, double exponential, and chi-squared (8) distributions. This applies even when sample sizes are very small (i.e., n= 5) and quite different in the groups, and also when the group distributions differ signifi cantly. In addition, the test’s robustness is independent of the pairing of group size with the degree of contamination in the distribution. Our results support the idea that the discrepancies between studies on the effect of non-normality may be primarily attributed to differences in the robustness criterion adopted, rather than to the degree of contamination of the distributions. These fi ndings highlight the need to establish a standard criterion of robustness to clarify the potential implications when performing Monte Carlo studies. The present analysis made use of Bradley’s criterion, which has been argued to be one of the most suitable criteria for Table 1 Specifi c conditions studied under non-normality for unequal shape in the groups as a function of total sample size (N), means group size (N/J), coeffi cient of sample size variation (Δn), and pairing of group size with the degree of distribution contamination: (+) the largest group size is associated with the greater contamination and vice versa, and (-) the largest group size is associated with the smallest contamination and vice versa n Pairing N N/J Δn + – 30 10 0.16 0.33 0.50 8, 10, 12 6, 10, 14 5, 8, 17 12, 10, 8 14, 10, 6 17, 8, 5 45 15 0.16 0.33 0.50 12, 15, 18 9, 15, 21 6, 15, 24 18, 15, 12 21, 15, 9 24, 15, 6 60 20 0.16 0.33 0.50 16, 20, 24 12, 20, 28 8, 20, 32 24, 20, 16 28, 20, 12 32, 20, 8 75 25 0.16 0.33 0.50 20, 25, 30 15, 25, 35 10, 25, 40 30, 25, 20 35, 25, 15 40, 25, 10 90 30 0.16 0.33 0.50 24, 30, 36 18, 30, 42 12, 30, 48 36, 30, 24 42, 30, 18 48, 30, 12 120 40 0.16 0.33 0.50 32, 40, 48 24, 40, 56 16, 40, 64 48, 40, 32 56, 40, 24 64, 40, 16 150 50 0.16 0.33 0.50 40, 50, 60 30, 50, 70 20, 50, 80 60, 50, 40 70, 50, 30 80, 50, 20 180 60 0.16 0.33 0.50 48, 60, 72 36, 60, 84 24, 60, 96 72, 60, 48 84, 60, 36 96, 60, 24 210 70 0.16 0.33 0.50 56, 70, 84 42, 70, 98 28, 70, 112 84, 70, 56 98, 70, 42 112, 70, 28 240 80 0.16 0.33 0.50 64, 80, 96 48, 80, 112 32, 80, 128 96, 80, 64 112, 80, 48 128, 80, 32 270 90 0.16 0.33 0.50 72, 90, 108 54, 90, 126 36, 90, 144 108, 90, 72 126, 90, 54 144, 90, 36 300 100 0.16 0.33 0.50 80, 100, 120 60, 100, 140 40, 100, 160 120, 100, 80 140, 100, 60 160, 100, 40 Non-normal data: Is ANOVA still a valid option? 555 examining the robustness of statistical tests (Keselman, Algina, Kowalchuk, & Wolfi nger, 1999). In this respect, our results are consistent with previous studies whose Type I error rates were within the bounds of Bradley’s criterion under certain departures from normality (Black et al., 2010; Clinch & Keselman, 1982; Feir-Walsh & Thoothaker, 1974; Gamage & Weerahandi, 1998; Kanji, 1976; Lantz, 2013; Lix et al., 1996; Patrick, 2007; Schmider et al., 2010; Zijlstra, 2004). By contrast, however, our results do not concur, at least for the conditions studied here, with those classical handbooks which conclude that F-test is only robust if the departure from normality is moderate (Keppel, 1982; Montgomery, 1991), the populations have the same distributional shape (Kirk, 2013), and the sample sizes are large and equal (Winer et al., 1991). Our fi ndings are useful for applied research since they show that, in terms of Type I error, F-test remains a valid statistical procedure under non-normality in a variety of conditions. Data transformation or nonparametric analysis is often recommended when data are not normally distributed. However, data transformations offer no additional benefi ts over the good control of Type I error achieved by F-test. Furthermore, it is usually diffi cult to determine which transformation is appropriate for a set of data, and a given transformation may not be applicable when groups differ in shape. In addition, results are often diffi cult to interpret when data transformations are adopted. There are also disadvantages to using non-parametric procedures such as the Kruskal-Wallis test. This test converts quantitative continuous data into rank-ordered data, with a consequent loss of information. Moreover, the null hypothesis associated with the Kruskal-Wallis test differs from that of F-test, unless the distribution of groups has exactly the same shape (see Maxwell & Delaney, 2004). Given these limitations, there is no reason to prefer the Kruskal-Wallis test under the conditions studied in the present paper. Only with equal shape in the groups might the Kruskal-Wallis test be preferable, given its power advantage over F-test under specifi c distributions (Büning, 1997; Lantz, 2013). However, other studies suggest that F-test is robust, in terms of power, to violations of normality under certain conditions (Ferreira, Rocha, & Mequelino, 2012; Kanji, 1976; Schmider et al., 2010), even with very small sample size (n = 3; Khan & Rayner, 2003). In light of these inconsistencies, future research should explore the power of F-test when the normality assumption is not met. At all events, we encourage researchers to analyze the distribution underlying their data (e.g., coeffi cients of skewness and kurtosis in each group, goodness of fi t tests, and normality graphs) and to estimate a priori the sample size needed to achieve the desired power. Table 2 Descriptive statistics of Type I error for F-test with equal shape for each combination of skewness (γ 1 ) and kurtosis (γ 2 ) across all conditions Distributions γ1 γ2 n Min Max Mdn M SD 1 0 0.4 = ≠ .0434 .0445 .0541 .0556 .0491 .0497 .0493 .0496 .0029 .0022 2 0 0.8 = ≠ .0444 .0458 .0534 .0527 .0474 .0484 .0479 .0487 .0023 .0016 3 0 -0.8 = ≠ .0468 .0426 .0512 .0532 .0490 .0486 .0491 .0487 .0014 .0024 4 0.4 0 = ≠ .0360 .0392 .0499 .0534 .0469 .0477 .0457 .0472 .0044 .0032 5 0.8 0 = ≠ .0422 .0433 .0528 .0553 .0477 .0491 .0476 .0491 .0029 .0030 6 -0.8 0 = ≠ .0427

Order a Similar Paper

About Us

Quick Links

We Accept

Contact Us

Hello!

You didn't find what you were looking for? Upload your specific requirements now and relax as your preferred tutor delivers a top quality customized paper

About Us

Quick Links

We Accept

Contact Us

Hello!