Interrater reliability of grading strength of evidence varies with the complexity of the evidence in systematic reviews

Nancy D. Berkman; Kathleen N. Lohr; Laura C. Morgan; Tzy Mey Kuo; Sally C. Morton

doi:10.1016/j.jclinepi.2013.06.002

Interrater reliability of grading strength of evidence varies with the complexity of the evidence in systematic reviews

Nancy D. Berkman, Kathleen N. Lohr, Laura C. Morgan, Tzy Mey Kuo, Sally C. Morton

Research output: Contribution to journal › Article › peer-review

21 Scopus citations

Abstract

Objectives To examine consistency (interrater reliability) of applying guidance for grading strength of evidence in systematic reviews for the Agency for Healthcare Research and Quality Evidence-based Practice Center program. Study Design and Setting Using data from two systematic reviews, authors tested the main components of the approach: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision) separately for randomized controlled trials (RCTs) and observational studies and (2) developing an overall strength of evidence grade, given the scores for each of these domains. Results Conclusions about overall strength of evidence reached by experienced systematic reviewers based on the same evidence can differ greatly, especially for complex bodies of evidence. Current instructions may be sufficient for straightforward quantitative evaluations that use meta-analysis for summarizing RCT findings. In contrast, agreement suffered when evaluations did not lend themselves to meta-analysis and reviewers needed to rely on their own qualitative judgment. Three areas raised particular concern: (1) evidence from a combination of RCTs and observational studies, (2) outcomes with differing measurement, and (3) evidence that appeared to show no differences in outcomes. Conclusion Interrater reliability was highly variable for scoring strength of evidence domains and combining scores to reach overall strength of evidence grades. Future research can help in establishing improved methods for evaluating these complex bodies of evidence.

Original language	English (US)
Pages (from-to)	1105-1117.e1
Journal	Journal of Clinical Epidemiology
Volume	66
Issue number	10
DOIs	https://doi.org/10.1016/j.jclinepi.2013.06.002
State	Published - Oct 2013
Externally published	Yes

Keywords

Agency for Healthcare Research and Quality
Comparative effectiveness
Evidence-based practice
Interrater reliability
Strength of evidence
Systematic review methodology

ASJC Scopus subject areas

Epidemiology

Access to Document

10.1016/j.jclinepi.2013.06.002

Cite this

@article{b9752809de354c8c8a0ea048eda80f52,

title = "Interrater reliability of grading strength of evidence varies with the complexity of the evidence in systematic reviews",

abstract = "Objectives To examine consistency (interrater reliability) of applying guidance for grading strength of evidence in systematic reviews for the Agency for Healthcare Research and Quality Evidence-based Practice Center program. Study Design and Setting Using data from two systematic reviews, authors tested the main components of the approach: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision) separately for randomized controlled trials (RCTs) and observational studies and (2) developing an overall strength of evidence grade, given the scores for each of these domains. Results Conclusions about overall strength of evidence reached by experienced systematic reviewers based on the same evidence can differ greatly, especially for complex bodies of evidence. Current instructions may be sufficient for straightforward quantitative evaluations that use meta-analysis for summarizing RCT findings. In contrast, agreement suffered when evaluations did not lend themselves to meta-analysis and reviewers needed to rely on their own qualitative judgment. Three areas raised particular concern: (1) evidence from a combination of RCTs and observational studies, (2) outcomes with differing measurement, and (3) evidence that appeared to show no differences in outcomes. Conclusion Interrater reliability was highly variable for scoring strength of evidence domains and combining scores to reach overall strength of evidence grades. Future research can help in establishing improved methods for evaluating these complex bodies of evidence.",

keywords = "Agency for Healthcare Research and Quality, Comparative effectiveness, Evidence-based practice, Interrater reliability, Strength of evidence, Systematic review methodology",

author = "Berkman, {Nancy D.} and Lohr, {Kathleen N.} and Morgan, {Laura C.} and Kuo, {Tzy Mey} and Morton, {Sally C.}",

year = "2013",

month = oct,

doi = "10.1016/j.jclinepi.2013.06.002",

language = "English (US)",

volume = "66",

pages = "1105--1117.e1",

journal = "Journal of Clinical Epidemiology",

issn = "0895-4356",

publisher = "Elsevier USA",

number = "10",

}

TY - JOUR

T1 - Interrater reliability of grading strength of evidence varies with the complexity of the evidence in systematic reviews

AU - Berkman, Nancy D.

AU - Lohr, Kathleen N.

AU - Morgan, Laura C.

AU - Kuo, Tzy Mey

AU - Morton, Sally C.

PY - 2013/10

Y1 - 2013/10

N2 - Objectives To examine consistency (interrater reliability) of applying guidance for grading strength of evidence in systematic reviews for the Agency for Healthcare Research and Quality Evidence-based Practice Center program. Study Design and Setting Using data from two systematic reviews, authors tested the main components of the approach: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision) separately for randomized controlled trials (RCTs) and observational studies and (2) developing an overall strength of evidence grade, given the scores for each of these domains. Results Conclusions about overall strength of evidence reached by experienced systematic reviewers based on the same evidence can differ greatly, especially for complex bodies of evidence. Current instructions may be sufficient for straightforward quantitative evaluations that use meta-analysis for summarizing RCT findings. In contrast, agreement suffered when evaluations did not lend themselves to meta-analysis and reviewers needed to rely on their own qualitative judgment. Three areas raised particular concern: (1) evidence from a combination of RCTs and observational studies, (2) outcomes with differing measurement, and (3) evidence that appeared to show no differences in outcomes. Conclusion Interrater reliability was highly variable for scoring strength of evidence domains and combining scores to reach overall strength of evidence grades. Future research can help in establishing improved methods for evaluating these complex bodies of evidence.

AB - Objectives To examine consistency (interrater reliability) of applying guidance for grading strength of evidence in systematic reviews for the Agency for Healthcare Research and Quality Evidence-based Practice Center program. Study Design and Setting Using data from two systematic reviews, authors tested the main components of the approach: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision) separately for randomized controlled trials (RCTs) and observational studies and (2) developing an overall strength of evidence grade, given the scores for each of these domains. Results Conclusions about overall strength of evidence reached by experienced systematic reviewers based on the same evidence can differ greatly, especially for complex bodies of evidence. Current instructions may be sufficient for straightforward quantitative evaluations that use meta-analysis for summarizing RCT findings. In contrast, agreement suffered when evaluations did not lend themselves to meta-analysis and reviewers needed to rely on their own qualitative judgment. Three areas raised particular concern: (1) evidence from a combination of RCTs and observational studies, (2) outcomes with differing measurement, and (3) evidence that appeared to show no differences in outcomes. Conclusion Interrater reliability was highly variable for scoring strength of evidence domains and combining scores to reach overall strength of evidence grades. Future research can help in establishing improved methods for evaluating these complex bodies of evidence.

KW - Agency for Healthcare Research and Quality

KW - Comparative effectiveness

KW - Evidence-based practice

KW - Interrater reliability

KW - Strength of evidence

KW - Systematic review methodology

UR - http://www.scopus.com/inward/record.url?scp=84883355019&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883355019&partnerID=8YFLogxK

U2 - 10.1016/j.jclinepi.2013.06.002

DO - 10.1016/j.jclinepi.2013.06.002

M3 - Article

C2 - 23993312

AN - SCOPUS:84883355019

SN - 0895-4356

VL - 66

SP - 1105-1117.e1

JO - Journal of Clinical Epidemiology

JF - Journal of Clinical Epidemiology

IS - 10

ER -

Interrater reliability of grading strength of evidence varies with the complexity of the evidence in systematic reviews

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this