Comparison of Power for Multiple Comparison Procedures
The number of methods for evaluating, and possibly making statistical decisions about, null contrasts - or their small sub-set, multiple comparisons - has grown extensively since the early 1950s. That demonstrates how important the subject is, but most of the growth consists of modest variations of the early methods. This paper examines nine fairly basic procedures, six of which are methods designed to evaluate contrasts chosen post hoc, i.e., after an examination of the test data. Three of these use experimentwise or familywise type 1 error rates (Scheffé 1953, Tukey 1953, Newman-Keuls, 1939 and 1952), two use decision-based type 1 error rates (Duncan 1951 and Rodger 1975a) and one (Fisher's LSD 1935) uses a mixture of the two type 1 error rate definitions. The other three methods examined are for evaluating, and possibly deciding about, a limited number of null contrasts that have been chosen independently of the sample data - preferably before the data are collected. One of these (planned t-tests) uses decision-based type 1 error rates and the other two (one based on Bonferroni's Inequality 1936, and the other Dunnett's 1964 Many-One procedure) use a familywise type 1 error rate. The use of these different type 1 error rate definitionsA creates quite large discrepancies in the capacities of the methods to detect true non-zero effects in the contrasts being evaluated. This article describes those discrepancies in power and, especially, how they are exacerbated by increases in the size of an investigation (i.e., an increase in J, the number of samples being examined). It is also true that the capacity of a multiple contrast procedure to 'unpick' 'true' differences from the sample data is influenced by the type of contrast the procedure permits. For example, multiple range procedures (such as that of Newman-Keuls and that of Duncan) permit only comparisons (i.e., two-group differences) and that greatly limits their discriminating capacity (which is not, technically speaking, their power). Many methods (those of Scheffé, Tukey's HSD, Newman-Keuls, Fisher's LSD, Bonferroni and Dunnett) place their emphasis on one particular question, "Are there any differences at all among the groups?" Some other procedures concentrate on individual contrasts (i.e., those of Duncan, Rodger and Planned Contrasts); so are more concerned with how many false null contrasts the method can detect. This results in two basically different definitions of detection capacity. Finally, there is a categorical difference between what post hoc methods and those evaluating pre-planned contrasts can find. The success of the latter depends on how wisely (or honestly well informed) the user has been in planning the limited number of statistically revealing contrasts to test. That can greatly affect the method's discriminating success, but it is often not included in power evaluations. These matters are elaborated upon as they arise in the exposition below.
Keywords: Multiple comparisons, post hoc contrasts, decision-based error rate, power loss