Using Correspondence Tests to Assess Replicability of Open Science Collaboration Results: Inferences from a SMART Design-based, Meta-analytic Approach
The logic of the SMART (Sequential Multiple Assignment Randomization Trial) design was applied to assess the replicability of original-replicate study pairs for Open Science Collaboration (OSC) intervention studies. Within SMART, we utilized both subtests of the correspondence test (CT) to assess study pair comparability. First, we implemented a CT difference test to determine if an effect size difference between the original study and its replicate pair was close to zero; second, we implemented a CT equivalence test to determine if the effect size difference of that study pair was within a designated threshold. In Stage 1 of SMART, each study pair was randomly assigned to one of two alphas (.01 and .05), thereby creating two, probabilistically similar subsets of study pairs. Within each alpha subset, successful difference tests (test of significance was not significantly different than zero) and unsuccessful difference tests were then determined. In Stage 2 of SMART, study pairs in each combination of alpha level and successful or unsuccessful difference tests were randomly assigned to one of two thresholds (±.25 SD, ±.50 SD). Equivalence tests were then conducted for all study pairs in each of these four subsets. Successful equivalence occurred when the distance between an original and its replicate pair was statistically significantly less than a given threshold. Thus, initial randomization followed by a second randomization was used to gauge comparability of each OSC original study and its replicate, for two alpha levels and two thresholds. In the first set of results, to mirror the common replicability assessment case in which only difference tests are conducted, 16 of 96 difference tests (16.7%) conducted in Stage 1 were successful. In the second set of results, for initially successful difference tests, two thresholds were used to determine the percent of study pairs that also passed the equivalence test. Depending on α and threshold, 8.0%-13.8% of studies successfully passed both difference and equivalence CT subtests. In the third set of results, using SMART, after randomization to two α-values and contingent on success or lack of success of a difference test, study pairs were randomized to two thresholds and a statistical test of equivalence conducted. Using meta-analysis methods within SMART-based subsets of study pairs, original-replicate average effect size differences were compared to differences in the second set of results. We found a similarly-sized 10.3% of study pairs passed both CT subtests (nine of 87 study pairs successfully passed the difference test at either alpha and successfully passed the equivalence test at either threshold). Reflecting the importance of incorporating both CT subtests, of 16 study pairs that initially passed the difference test, nearly half (43.7%) failed the equivalence test. Thus, for CT success, we found that α choice had little impact, while threshold choice was an important determinant. In all three sets of results, the percent of successful replications was substantially smaller than the 36% of OSC replicates that were statistically significant. To confirm this study’s replicability, we found very similar patterns of CT success and lack of success for two, SMART-based tables, one for alpha = .01 and one for alpha = .05. The current research extends the utility of CT established by Steiner and Wong (2018) in which results were based on simulation data.
Keywords: Correspondence test, Open Science Collaboration, Replication, SMART