[Die Verwendung des nichtparametrischen Wilcoxon-Mann-Whitney-Tests in der Analyse medizinischer Studien]
Corinna Kühnast 1Markus Neuhäuser 1,2
1 Institute for Medical Informatics, Biometry and Epidemiology, University of Duisburg-Essen, Essen, Germany
2 Department of Mathematics and Technique, RheinAhrCampus, Koblenz University of Applied Sciences, Remagen, Germany
Zusammenfassung
Hintergrund: In biomedizinischen Studien ist die Annahme einer Normalverteilung der Daten oft nicht vertretbar. Trotz geeigneterer alternativer Testverfahren werden in solchen Studien sehr häufig parametrische Tests zur Datenanalyse eingesetzt.
Methoden: Wir untersuchten Studien aus fünf medizinischen Zeitschriften, welche den t-test für unverbundene Stichproben und/oder den nichtparametrischen Wilcoxon-Mann-Whitney Test enthielten. Das Ziel war es, Zusammenhänge zwischen der Wahl eines parametrischen oder nichtparametrischen Tests und anderen Faktoren einer Studie, wie zum Beispiel Zeitschriftentyp, Fallzahl, Randomisierung oder Sponsoring, nachzuweisen.
Ergebnisse: Der nichtparametrische Wilcoxon-Mann-Whitney-Test wurde in 30% der Studien verwendet. In einer multivariablen logistischen Regression waren die Variablen Zeitschriftentyp, Versuchseinheit, Skalenniveau und Statistiksoftware signifikant. Der Wilcoxon-Mann-Whitney-Test wurde besonders häufig dann eingesetzt, wenn die Daten nicht stetig waren, die Zeitschrift einen hohen Impactfaktor vorwies, in Studien, welche am Menschen durchgeführt wurden, und wenn die Statistiksoftware (besonders SPSS) benannt wurde.
Introduction
When looking into the medical literature one gets the impression that parametric statistical methods such as Student’s t-test are common standard, although the underlying normal assumption is often not tenable, especially for small or moderate sample sizes. On the one hand, empirical work has shown that deviations from a normal distribution are frequent even for continuous data [1]. According to Nanna and Sawilowsky [2], normality is the exception rather than the norm in applied research. However, for large sample sizes one may rely on the central limit theorem and apply a test designed for normally distributed data. On the other hand, ordinal data are widespread in biomedical research [3]. For such data non-parametric tests based on ranks are appropriate, but the statistical analysis is often not performed properly, as shown e.g. by Jakobsson [4] for the analysis of ordinal data in nursing research.
Sometimes a transformation is applied in order to normalize continuous, but non-normal data. However, in case of non-normal data it is preferable to perform a non-parametric test. Transformations can often not be applied since the transformation “must be motivated from previous experimental or scientific evidence. Unless determined a priori, transforms can be misused to inflate or mitigate observed significance in a spurious fashion” ([5], p. 130). Furthermore, the hypotheses before and after the transformation may differ [6]. Hence, the use of transformations for the sole purpose of complying with the assumptions of parametric tests is dangerous [7].
We investigated how frequent the t-test and its non-parametric competitor, the Wilcoxon-Mann-Whitney (WMW) test, are used in medical studies. It is enquired which factors and variables are important for the choice between the non-parametric WMW test and the parametric t-test for studies that compare two independent groups, published in medical journals with different scopes and impact. It will be discussed whether the decision for one of the methods is appropriate or not.
Methods
All original work related to medical studies published in 2004 in five biomedical journals was surveyed. The three journals American Journal of Physiology (Heart Circ. Physiol.), Annals of Surgery, and Circulation Research were considered because they were also included in a previous study [8]. In addition, The Lancet and The New England Journal of Medicine were included in our study. These journals were categorized into two groups with different topics and impact factors (Table 1 [Tab. 1]). Each paper was thoroughly checked by the first author, on whether it included original material on not yet published data, irrespective of medical subject, study design or size/format of the paper.
              Table 1: Included journals and number of studies
            
For the analyses presented here all studies, which contain at least the unpaired t-test or the WMW test, were included. In addition to the test statistic the following factors and variables were also inspected: type of journal, sample size, kind of test objects, scale of measurements, information about randomization, sponsoring by pharmaceutical companies, and the used statistical software.
Analyses were performed with logistic regressions. When the software used for analysis cannot perform both the t-test and the WMW test the respective study was excluded from the logistic regression analysis. The total sample size was categorized into three categories with an approx. equal number of studies (<15, 15-<50, ≥50). Odds ratios (OR) and their 95% confidence intervals (95%-CI) were estimated by logistic regressions. A p-value ≤0.05 was considered as significant. Because of the exploratory nature of our study no multiplicity adjustment was applied [9]. Both authors analyzed the data.
Results
In total, 1879 publications were surveyed, and 630 studies could be included in the analyses (Table 1 [Tab. 1]). Altogether the use of the unpaired t-test predominates in studies where two groups were compared. In 112 studies (18%) only the WMW and in 444 studies (70%) only the unpaired t-test is used; 74 times (12%) both tests are applied within one study. Please note that the two tests may be used to analyse different variables, however, it was also found that identical variables were analysed with both tests. In the logistic regressions presented below the studies without the WMW test are compared with the studies with the WMW test.
Two of the 630 studies were excluded from the logistic regression analyses because the specified software cannot perform the WMW test. The univariate analyses show significant relationships between the use of the WMW test and the journal type. The WMW test is more common in the diverse and high-impact journals The New England Journal of Medicine and The Lancet (p≤0.001, OR=5.21, 95%-CI: 3.53-7.69). Moreover, the WMW test is more common in studies in humans (p≤0.001, OR=6.44, 95%-CI: 4.42-9.38), and, not surprisingly, in studies with non-continuous variables (p≤0.001, OR=8.49, 95%-CI: 4.73-15.27). In addition, the statistical software used is significantly related to the choice between the two statistical tests (p≤0.001). In particular, the WMW test is more common when one of the two common software packages SPSS (p=0.004, OR=4.64, 95%-CI: 2.48-8.69) and SAS (p=0.030, OR=4.34, 95%-CI: 1.96-9.61) is used. Another significant relationship was found regarding information about randomization (p≤0.001, OR=2.44, 95%-CI: 1.70-3.50).
The WMW test seems to be more common when the study is sponsored by a pharmaceutical company (p=0.028, OR=2.32, 95%-CI: 1.10-4.90). The sample size was also significant in the univariate logistic regression (p≤0.001). In particular, the WMW test was applied more often in case of large samples (i.e. n≥50) than in case of small samples (i.e. n<15) (p=0.001, OR=5.88, 95%-CI: 3.68-9.39).
Obviously, the different factors are not independent. Therefore, a multivariable logistic regression was applied in order to confirm the univariate results. The type of journal, the test object (studies in humans or in other subjects), the scale of measurement (continuous or not) and the statistical software used remained significant (Table 2 [Tab. 2]). The factors randomization, sponsoring and the categorized sample size are no longer significant. With regard to the software, SAS is no longer significant, either. The multivariate regression gives a significantly larger probability for performing the WMW test for SPSS, only.
              Table 2: Results of the univariate and multivariable logistic regressions
            
Sometimes, to be precise, in 57 studies, a reason is specified for using the WMW test. The most common reasons are “non-normal data” and “categorical data”. Further correct reasons are “requirements for t-test not fulfilled” and “small sample sizes”. However, the latter reason is correct only when applying the exact (permutation) version of the WMW test. There are also reasons that are problematic from a statistical point of view: In four studies the WMW test was applied before or after the t-test, at least partly because the t-test was not significant. In one further study the WMW test was used because an observed heterogeneity in variances. However, the WMW test cannot guarantee the significance level in case of unequal variances [10]. Moreover, the specified reason “in order to compare medians” is correct only if a pure location shift between the two distributions can be assumed.
As mentioned above, one may rely on the central limit theorem when sample sizes are large and, consequently, one may apply a parametric test such as the t-test. However, in 395 out of the considered 630 studies the (total) sample size is less than 50. In 89% (353) of these studies with low sample size the t-test was applied, sometimes in addition to the WMW test (34 studies). In the remaining 319 studies with low sample size the t-test, but not the WMW test, was used. However, in 317 out of these 319 studies (99%) there are continuous variables. Hence, given the relatively high robustness of the t-test to skew continuous distributions [11], the basic assumptions seem to be fulfilled in the vast majority of studies when applying the t-test.
In case of more than two groups the Kruskal-Wallis test can be applied as a non-parametric test instead of the WMW test. When considering the 1879 surveyed publications the Kruskal-Wallis test was applied in 53 studies. Many of these studies have a low sample size smaller than 50 (23 studies) and/or non-continuous data (18 studies). The parametric analogue, an analysis of variance (ANOVA), was found in 658 studies. However, these 658 studies cannot be compared with the 53 studies with a Kruskal-Wallis test because an ANOVA is much more flexible than the Kruskal-Wallis test and can also be applied in studies with more complex designs.
Discussion
The assertions some authors made about their decisions for the WMW and the attributes of the published data indicate that the scale of measurement is the primary factor for a decision in favour of a non-parametric test. However, there are three further factors that remained significant in the multivariable logistic regression.
The study subject is one of these significant factors. The WMW test is more often used in studies in humans. However, in these studies non-continuous variables are more common as well (Table 3 [Tab. 3]). Furthermore, the software has a significant influence.
              Table 3: Frequencies of study subject by scale of measurement
            
A further significant factor is the type of journal. A possible explanation is that the high-impact journals have a more detailed statistical review and that they may reject a paper because of an inappropriate statistical analysis. In line with this, studies published in journals with high impact factors often contain a more detailed methodical description compared to studies published in other journals. Please note in this context that The New England Journal of Medicine says in its instructions for authors that “non-parametric methods should be used to compare groups when the distribution of the dependent variable is not normal” (http://authors.nejm.org/help/newms.asp).
In addition to The Lancet and The New England Journal of Medicine we included the three journals American Journal of Physiology (Heart Circ. Physiol.), Annals of Surgery, and Circulation Research in our study. These three latter journals were also included in a previous study [8]. This sample of five journals is not necessarily representative for the multitude of biomedical journals. However, we are able to compare our results towards the work of Ludbrook and Dudley [8]. This comparison indicates that the behaviour of medical scientists with parametric and non-parametric tests did not change considerably. Ludbrook and Dudley’s [8] findings about the handling with statistical methods can be approved even ten years later.
Given the higher efficiency of non-parametric tests for non-normal data [12], non-parametric tests such as the WMW test should be applied more often, especially when the sample size is not very large. In other areas of life sciences the WMW test seems to be more common. Ruxton [13] surveyed one volume of the journal Behavioral Ecology. The WMW test was applied in 21/33=64% of the papers that used the two-sample t-test and/or the WMW test.
Notes
Conflicts of interest
None declared.
References
[1] Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychol Bull. 1989;105:156-66.[2] Nanna MJ, Sawilowsky SS. Analysis of Likert scale data in disability and medical rehabilitation research. Psychol Methods. 1998;3:55-67.
[3] Rabbee N, Coull BA, Mehta C, Patel, N, Senchaudhuri P. Power and sample size for ordered categorical data. Stat Methods Med Res. 2003;12(1):73-84.
[4] Jakobsson U. Statistical presentation and analysis of ordinal data in nursing research. Scand J Caring Sci. 2004;18(4):437-40.
[5] Piegorsch WW, Bailer AJ. Statistics for environmental biology and toxicology. London, England: Chapman & Hall; 1997.
[6] Games PA. Data transformation, power, and skew: a rebuttal to Levine and Dunlap. Psychol Bull. 1984;95:345-7.
[7] Wilson JB. Priorities in statistics, the sensitive feet of elephants and don't transform data. Folia Geobotanica. 2007;42:161-7.
[8] Ludbrook J, Dudley H. Why permutation tests are superior to t and F tests in biomedical research. Am Stat. 1998;52(2):127-32.
[9] Neuhäuser M. How to deal with multiple endpoints in clinical trials. Fundam Clin Pharmacol. 2006;20(6):515-23.
[10] Kasuya E. Mann-Whitney U test when variances are unequal. Anim Behav. 2001;61(6):1247-9.
[11] Posten HO. The robustness of the two-sample t-test over the Pearson system. J Stat Comput Simul 1978;6:295-311.
[12] Lehmann EL. Non-parametrics: Statistical methods based on ranks. San Francisco, CA: Holden-Day; 1975.
[13] Ruxton GD. The unequal variance t-test is an underused alternative to Student's t-test and the Mann-Whitney U test. Behav Ecol. 2006;17(4):688-90.
 
                                                        


