top of page
Diagnostics & Therapy

Methods for Outcome Research

4. Juli 2024

Falk Leichsenring

Methods for Outcome Research

Outcome research[*]

Falk Leichsenring and Christiane Steinert

In the following, basic issues of outcome research will be presented.

Superiority, Equivalence and Non-inferiority

In a traditional (two-sided) superiority trial the null and alternative hypotheses are H0: Δ = 0 and HA: Δ ≠0, respectively (Lesaffre, 2008). If the 95% CI does not include Δ = 0, the difference is significant at p=0.05. Equivalence and non-inferiority testing, however, differ from superiority testing (Leichsenring et al., 2015; Lesaffre, 2008; Walker & Nowacki, 2011). For showing equivalence of a test treatment to a treatment already established in efficacy, Chambless and Hollon (1998, p. 18) just stated that the  power must be “sufficient to detect moderate differences”.  However, from an insignificant difference a state of no difference cannot be concluded (Barker et al., 2002; Walker & Nowacki, 2011). Chambless and Hollon (1998) seem to have used the logic of superiority testing here. Non-inferiority or equivalence testing, however, require to a priori specify a non-inferiority (ΔNI) or equivalence margin (ΔE) by which the test treatment may be less efficacious - in case of equivalence testing also more efficacious - than the standard treatment (Leichsenring et al., 2015; Lesaffre, 2008; Walker & Nowacki, 2011). Compared to superiority testing, the statistical null and alternative hypothesis are reversed (Leichsenring et al., 2015; Lesaffre, 2008; Walker & Nowacki, 2011). The acceptance of the alternative hypothesis (HA: ΔE ≤ Δ ≤ ΔE) is consistent with equivalence. Equivalence is tested by two one-sided tests (TOST), each at alpha=0.05, yielding a 90% confidence interval (Walker & Nowacki, 2011). Equivalence is demonstrated if the effect size and its confidence interval is completely included in the margin (ΔE, - ΔE). In case of non-inferiority testing only one bond of the 95% confidence interval is looked at, corresponding to a one-sided 97.5% CI that is unbounded at one side and to a one-sided significance level of 0.025  (Lesaffre, 2008). Nowadays a non-inferiority test is performed at a one-sided 0.025 significance level (Lesaffre, 2008). The traditional two-sided superiority test and equivalence testing (TOST) were shown yield contradictory results in many cases (Barker et al., 2002). Furthermore, the sample size must be large enough to show non-inferiority or equivalence with a sufficient statistical power  (Leichsenring et al., 2015). As the margin usually represents a relatively small effect size, large samples are required for showing non-inferiority or equivalence (Leichsenring et al., 2015). Meta-analyses are useful to achieve a sufficient statistical power (Borenstein et al., 2011; Steinert et al., 2017).

The Chambless and Hollon criteria

In 1998 Chambless and Hollon (1998, p. 18) proposed the following criteria for empirically supported psychological therapies (EST) - of note: they do not use the term psychotherapies.

  1. A comparison with no treatment control group, alternative treatment, group, or placebo

(a) in a randomized controlled trial, controlled single case experiment or equivalent time samples design, and

(b) in which the EST statistically significantly superior to no treatment, placebo, or alternative treatments, or in which the EST is equivalent to a treatment already established in efficacy, and power is sufficient to detect moderate differences.

  1. These studies are required to have been conducted with

(a) a treatment manual or logical equivalent;

(b) a population treated for specific problems, for whom inclusion criteria have been delineated in a reliable, valid manner;

(c) reliable and valid outcome assessment measures, at minimum tapping the problems targeted for change; and; and

(d) appropriate data analysis.

3. A designation of efficacious requires that the EST has shown superiority in at least two independent research settings. In case of conflicting evidence, the preponderance of the data must support the EST.

4. For a designation as possibly efficacious, one study is sufficient in the absence of conflicting evidence.

5. For a designation as efficacious and specific, the EST must have been shown to be statistically significantly superior to pill of psychological placebo or to an alternative bona fide treatment in at least to independent research settings. In case of conflicting evidence, the preponderance of the data must support the EST’s efficacy and specificity.

The Chambless and Hollon criteria may be used as a check-list for ESTs, requiring:

(1) an RCT (or single case experiment or equivalent time samples design)

(2) a treatment manual or its logical equivalent

(3) a specifically, reliably and validly defined clinical population treated for a specific problem

(4)  reliable and valid outcome assessment measures

(5)  appropriate data analysis

For a designation of efficacious, the EST must be

(6) either superior to no treatment, placebo or alternative treatment

(7) or equivalent to an already established treatment with a power sufficient to detect moderate differences

(8) in at least two independent research settings

(9) in case of conflicting evidence, the preponderance of the data must support the EST’s efficacy

For a designation of efficacious and specific, the EST must be

(1) superior to pill of psychological placebo or to an alternative bona fide treatment

(2) in at least to independent research settings

(3) In case of conflicting evidence, the preponderance of the data must support the EST’s efficacy and specificity

(4) In case of conflicting evidence, the preponderance of the data must support the EST’s efficacy and specificity

Problems of the Chambless and Hollon criteria

The Chambless and Hollon criteria were criticized for several reasons. A summary was given by Tolin et al. (2015). Tolin et al. (2015) updated the criteria of EST which will be described below in more detail.

The Tolin et al 2015 criteria of ESTs

Taking the critique of the Chambless and Hollon criteria into account, Tolin et al. (2015) updated the criteria  of the American Psychological Association for ESTs. They put a focus on the following issues, presented here as a check-list:

(1) recent meta-analyses of RCTs instead of individual RCTs

(2) sufficient conceptual homogeneity between treatments included in a meta-analysis

(3) quality of the meta-analysis

(4) quality of the individual studies

(5) risk of bias of the individual studies

(6) treatment fidelity

(7) acceptable (statistical) heterogeneity

(8) clinically meaningful effects in symptom improvement

(9) clinically meaningful effects in functioning in addition to symptom improvement

(10) long-term (≥ 3 months) in addition to short-term effects

(11) no clinically significant differences in efficacy compared to other active treatments

(12) syndromes, not only categorical diagnoses

(13) generalizability (effectiveness, complex patients, usual therapists)

(14) data on harms justifying the application of the treatment

(15) data on cost-effectiveness justifying the application of the treatment

(16) a positive balance between benefits and costs and harms

(17) evidence for mechanisms of change

The new EST model (Tolin et al., 2015) suggests the use of the ‘Grading of Recommendations Assessment, Development, and Evaluation’ (GRADE) system to assess the quality of evidence and the degree to which benefits exceed potential harms by an expert committee (Atkins et al., 2004; Guyatt et al., 2006; G. H. Guyatt, A. D. Oxman, H. J. Schunemann, et al., 2011; Guyatt et al., 2008).  The original GRADE system allows to rate the evidence as “high quality”, “moderate quality” or “low quality” or “very low quality” (Atkins et al., 2004; Guyatt et al., 2006; Guyatt et al., 2008).  Some observational studies, for example, may show “very low quality”(G. Guyatt et al., 2011; G. H. Guyatt, A. D. Oxman, H. J. Schunemann, et al., 2011).  If there are differences in ratings of evidence between primary (critical) outcomes and other outcomes (e.g. side effects or costs), GRADE regards efficacy outcomes as the most important on most occasions and suggests that guideline panels can base their rating of the quality of evidence exclusively on data on efficacy (Guyatt et al., 2006, p. 180).

For high quality evidence, the new model (Tolin et al., 2015) requires a wide range of studies with no major limitations, small heterogeneity and narrow confidence intervals (Tolin et al., 2015). These recommendations deviate considerably from the original approach of the GRADE group (Leichsenring et al., 2023)  who considered  “one or more well-designed RCTs yielding consistent directly applicable results” as necessary for high quality evidence (Guyatt et al., 2006, p. 178).  Thus, requiring “a wide range of studies”, the new recommendations may result in fewer treatments being designated as being of high quality compared to the original GRADE approach. Moderate quality evidence is defined  by the updated model as  “a few” studies, of which some have limitations, but no major flaws, with some variation between  studies or a wide confidence interval for the summary estimate(Tolin et al., 2015). Again, this recommendation  deviates from the original approach of the GRADE group which defined moderate quality evidence for RCTs in terms of “important” limitations(Guyatt et al., 2006, p. 178).  Importantly, no definition is given for either “a wide range of studies, or “few studies”, and for “narrow” or “wide” confidence intervals, leaving room for interpretation. Low quality evidence was originally restricted by the GRADE group as referring to observational studies and only occasionally to RCTs with multiple serious limitations (Guyatt et al., 2006), whereas the newly proposed criteria define low quality evidence as referring to “studies” (no specification for RCTs or observational studies) with major flaws, or where there are important variations between studies and very wide confidence intervals for the summary estimate (Tolin et al., 2015, p. 14). In a next step, the original GRADE system results in “strong” or “weak” recommendations for a treatment(Atkins et al., 2004; Guyatt et al., 2006; Guyatt et al., 2008).  In the new model of ESTs a third category was introduced, i.e.  a “very strong” recommendation (Tolin et al., 2015).

The new EST model was applied to psychodynamic therapy which proved to be an empirically supported treatment for depressive, anxiety, personality and somatoform disorders (Leichsenring et al., 2023). This review may be used as an example how to apply the new EST model. In contrast to some other recent reviews using the Tolin et al 2015 criteria for ESTs (Boness et al., 2020; Boness, 2022; Pfund et al., in press; Tolin, 2015), the review on psychodynamic therapy (Leichsenring et al., 2023) applied several criteria specified by the new EST model, but not used in those other reviews (Boness et al., 2020; Boness, 2022; Tolin, 2015),  including  an explicit assessment of the original GRADE rating (risk of bias of individual studies, rating of inconsistency, indirectness and  imprecision via the optimal information size, OIS), as well as assessments of the quality and treatment fidelity of individual studies and of clinical significance of effect sizes.[†] Thus, in the CBT world (Boness et al., 2020; Boness, 2022; Tolin, 2015), the Tolin et al. 2015 criteria were applied in a somehow different way, with the exception of a review by Pfund et al. (2023).


[*] Nowadays, studies need to be pre-registered.

[†] The OIS is the sample size required to detect a clinically meaningful effect size with a power of 0.80 at α=0.05 (G. H. Guyatt, A. D. Oxman, R. Kunz, et al., 2011; Schünemann et al., 2022).

References
  • Atkins, D., Eccles, M., Flottorp, S., Guyatt, G. H., Henry, D., Hill, S., Liberati, A., O'Connell, D., Oxman, A. D., Phillips, B., Schunemann, H., Edejer, T. T., Vist, G. E., Williams, J. W., Jr., & Group, G. W. (2004). Systems for grading the quality of evidence and the strength of recommendations I: critical appraisal of existing approaches The GRADE Working Group. BMC Health Services Research, 4(1), 38. https://doi.org/10.1186/1472-6963-4-38

  • Barker, L. E., Luman, E. T., McCauley, M. M., & Chu, S. Y. (2002). Assessing equivalence: an alternative to the use of difference tests for measuring disparities in vaccination coverage [Comparative Study]. American Journal of Epidemiology, 156(11), 1056-1061. http://www.ncbi.nlm.nih.gov/pubmed/12446263

  • Boness, C. L., Hershenberg, R., Kaye, J., Mackintosh, M. A., Grasso, D. J., Noser, A., & Raffa, S. D. (2020). An Evaluation of Cognitive Behavioral Therapy for Insomnia: A Systematic Review and Application of Tolin's Criteria for Empirically Supported Treatments. Clin Psychol (New York), 27(4). https://doi.org/10.1111/cpsp.12348

  • Boness, C. L., Votaw, V., Schwebel, F. J., Moniz-Lewis, D. I., McHugh, R. K., & Witkiewitz, K. . (2022). An Evaluation of Cognitive Behavioral Therapy for Substance Use: An Application of Tolin’s Criteria for Empirically Supported Treatments. Retrieved from osf.io/rbx8s,  25/01/ 2023.

  • Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2011). Introduction to meta-analysis. Wiley.

  • Chambless, D. L., & Hollon, S. D. (1998). Defining empirically supported therapies. Journal of Consulting and Clinical Psychology, 66, 7-18.

  • Guyatt, G., Gutterman, D., Baumann, M. H., Addrizzo-Harris, D., Hylek, E. M., Phillips, B., Raskob, G., Lewis, S. Z., & Schunemann, H. (2006). Grading strength of recommendations and quality of evidence in clinical guidelines: report from an american college of chest physicians task force. Chest, 129(1), 174-181. https://doi.org/10.1378/chest.129.1.174

  • Guyatt, G., Oxman, A. D., Akl, E. A., Kunz, R., Vist, G., Brozek, J., Norris, S., Falck-Ytter, Y., Glasziou, P., DeBeer, H., Jaeschke, R., Rind, D., Meerpohl, J., Dahm, P., & Schunemann, H. J. (2011). GRADE guidelines: 1. Introduction-GRADE evidence profiles and summary of findings tables. Journal of Clinical Epidemiology, 64(4), 383-394. https://doi.org/10.1016/j.jclinepi.2010.04.026

  • Guyatt, G. H., Oxman, A. D., Kunz, R., Brozek, J., Alonso-Coello, P., Rind, D., Devereaux, P. J., Montori, V. M., Freyschuss, B., Vist, G., Jaeschke, R., Williams, J. W., Jr.,, Murad, M. H., Sinclair, D., Falck-Ytter, Y., Meerpohl, J., Whittington, C., Thorlund, K., Andrews, J., & Schunemann, H. J. (2011). GRADE guidelines 6. Rating the quality of evidence--imprecision. Journal of Clinical Epidemiology, 64(12), 1283-1293. https://doi.org/10.1016/j.jclinepi.2011.01.012

  • Guyatt, G. H., Oxman, A. D., Schunemann, H. J., Tugwell, P., & Knottnerus, A. (2011). GRADE guidelines: a new series of articles in the Journal of Clinical Epidemiology. Journal of Clinical Epidemiology, 64(4), 380-382. https://doi.org/10.1016/j.jclinepi.2010.09.011

  • Guyatt, G. H., Oxman, A. D., Vist, G. E., Kunz, R., Falck-Ytter, Y., Alonso-Coello, P., Schunemann, H. J., & Group, G. W. (2008). GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ, 336(7650), 924-926. https://doi.org/10.1136/bmj.39489.470347.AD

  • Leichsenring, F., Abbass, A., Heim, N., Keefe, J. R., Kisely, S., Luyten, P., Rabung, S., & Steinert, C. (2023). The status of psychodynamic psychotherapy as an empirically supported treatment for common mental disorders - an umbrella review based on updated criteria. World Psychiatry, 22(2), 286-304. https://doi.org/10.1002/wps.21104

  • Leichsenring, F., Luyten, P., Hilsenroth, M. J., Abbass, A., Barber, J. P., Keefe, J. R., Leweke, F., Rabung, S., & Steinert, C. (2015). Psychodynamic therapy meets evidence-based medicine: a systematic review using updated criteria [Review]. Lancet Psychiatry, 2(7), 648-660. https://doi.org/10.1016/S2215-0366(15)00155-8

  • Lesaffre, E. (2008). Superiority,equivalence, and on-inveriority trials. Bulletin of the NYU Hospital for Joint Diseases, 66, 150-154.

  • Pfund, R. A., Ginley, M. K., Boness, C. L., , , Rash, C. J., Zajac, K., & Witkiewitz1, K. (in press). Contingency Management for Drug Use Disorders: Meta-Analysis and Application of Tolin’s Criteria Clinical Psychology - Science and Practice.

  • Pfund, R. A., Ginley, M. K., Kim, H. S., Boness, C. L., Horn, T. L., & Whelan, J. P. (2023). Cognitive-behavioral treatment for gambling harm: Umbrella review and meta-analysis. Clinical Psychology Review, 105, 102336. https://doi.org/10.1016/j.cpr.2023.102336

  • Schünemann, H. J., Higgins, J. P. T., Vist, G. E., Glasziou, P., Akl, E. A., Skoetz, N., Guyatt;, G. H., & on behalf of the Cochrane GRADEing Methods Group (formerly Applicability and Recommendations Methods Group) and the Cochrane Statistical Methods Group. (2022). Chapter 14: Completing ‘Summary of findings’ tables and grading the certainty of the evidence. In T. J. Higgins JPT, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). (Ed.), Cochrane Handbook for Systematic Reviews of Interventions version 6.3 (updated February 2022).

  • Steinert, C., Munder, T., Rabung, S., Hoyer, J., & Leichsenring, F. (2017). Psychodynamic Therapy: As efficacious as other empirically supported treatments? A meta-analysis testing equivalence of outcomes. American Journal of Psychiatry, 174(10), 943-953. https://doi.org/10.1176/appi.ajp.2017.17010057

  • Tolin, D., McKay, D., Forman, E. M., Klonsky, E. D., & Thombs, B. D. (2015). Empirically supported treatments: recommendations fo a new model. Clinical Psychology Science and Practice https://doi.org/10.1111.cpsp.12122

  • Tolin, D., Melnyk, T. & Marx, B.  . (2015). Exposure and Response Prevention for Obsessive-Compulsive Disorder. Retrieved from https://div12.org/treatment/exposure-and-response-prevention-for-obsessive-compulsive-disorder/25/01/2023.

  • Walker, E., & Nowacki, A. S. (2011). Understanding equivalence and noninferiority testing [Review]. Journal of General Internal Medicine, 26(2), 192-196. https://doi.org/10.1007/s11606-010-1513-8

bottom of page