What should inter rater reliability be




















By reabstracting a sample of the same charts to determine accuracy, we can project that information to the total cases abstracted and thus gauge the abstractor's knowledge of the specifications. We perform IRR often due to the dynamic aspect of measures and their specifications. IRR assessments are performed on a sample of abstracted cases to measure the degree of agreement among reviewers. DEAR results should be used to identify data element mismatches and pinpoint education opportunities for abstractors.

It is also important to analyze the DEAR results for trends among mismatches within a specific data element or for a particular abstractor to determine if a more focused review is needed to ensure accuracy across all potentially affected charts. In other words, the same numerator and denominator values reported by the original abstractor should be obtained by the second abstractor. CAAR results should be used to identify the overall impact of data element mismatches on the measure outcomes.

Incorporating Inter-Rater Reliability into your routine can reduce data abstraction errors by identifying the need for abstractor education or re-education and give you confidence that your data is not only valid, but reliable. If a test has lower inter-rater reliability, this could be an indication that the items on the test are confusing, unclear, or even unnecessary.

There are two common ways to measure inter-rater reliability:. The simple way to measure inter-rater reliability is to calculate the percentage of items that the judges agree on. This is known as percent agreement , which always ranges between 0 and 1 with 0 indicating no agreement between raters and 1 indicating perfect agreement between raters.

For example, suppose two judges are asked to rate the difficulty of 10 items on a test from a scale of 1 to 3. The results are shown below:. The higher the inter-rater reliability, the more consistently multiple judges rate items or questions on a test with similar scores.

However, higher inter-rater reliabilities may be needed in specific fields. An example of the kappa statistic calculated may be found in Figure 3. Notice that the percent agreement is 0.

The greater the expected chance agreement, the lower the resulting value of the kappa. Unfortunately, marginal sums may or may not estimate the amount of chance rater agreement under uncertainty. Thus, it is questionable whether the reduction in the estimate of agreement provided by the kappa statistic is actually representative of the amount of chance rater agreement. Theoretically, Pr e is an estimate of the rate of agreement if the raters guessed on every item, and guessed at rates similar to marginal proportions, and if raters were entirely independent None of these assumptions is warranted, and thus there is much disparity of opinion about the use of the Kappa among researchers and statisticians.

A good example of the reason for concern about the meaning of obtained kappa results is exhibited in a paper that compared human visual detection of abnormalities in biological samples with automated detection Are the obtained results indicative of the great majority of patients receiving accurate laboratory results and thus correct medical diagnoses or not?

While data sufficient to calculate a percent agreement are not provided in the paper, the kappa results were only moderate.

How shall the laboratory director know if the results represent good quality readings with only a small amount of disagreement among the trained laboratory technicians, or if a serious problem exists and further training is needed?

Unfortunately, the kappa statistic does not provide enough information to make such a decision. Furthermore, a kappa may have such a wide confidence interval CI that it includes anything from good to poor agreement. Once the kappa has been calculated, the researcher is likely to want to evaluate the meaning of the obtained kappa by calculating confidence intervals for the obtained kappa. The percent agreement statistic is a direct measure and not an estimate.

There is therefore little need for confidence intervals. The kappa is, however, an estimate of interrater reliability and confidence intervals are therefore of more interest. Theoretically, the confidence intervals are represented by subtracting from kappa from the value of the desired CI level times the standard error of kappa. The formula for a confidence interval is:.

The larger the number of observations measured, the smaller the expected standard error. While the kappa can be calculated for fairly small sample sizes e. As a general heuristic, sample sizes should not consist of less than 30 comparisons.

Sample sizes of 1, or more are mathematically most likely to produce very small CIs, which means the estimate of agreement is likely to be very precise. Both percent agreement and kappa have strengths and limitations. The percent agreement statistic is easily calculated and directly interpretable. Its key limitation is that it does not take account of the possibility that raters guessed on scores. It thus may overestimate the true agreement among raters. The kappa was designed to take account of the possibility of guessing, but the assumptions it makes about rater independence and other factors are not well supported, and thus it may lower the estimate of agreement excessively.

Furthermore, it cannot be directly interpreted, and thus it has become common for researchers to accept low kappa values in their interrater reliability studies. Low levels of interrater reliability are not acceptable in health care or in clinical research, especially when results of studies may change clinical practice in a way that leads to poorer patient outcomes. Perhaps the best advice for researchers is to calculate both percent agreement and kappa. If there is likely to be much guessing among the raters, it may make sense to use the kappa statistic, but if raters are well trained and little guessing is likely to exist, the researcher may safely rely on percent agreement to determine interrater reliability.

Potential conflict of interest. National Center for Biotechnology Information , U. Journal List Biochem Med Zagreb v. Biochem Med Zagreb. Mary L. Author information Article notes Copyright and License information Disclaimer.

Corresponding author: moc. Received Aug 17; Accepted Aug This article has been cited by other articles in PMC. Abstract The kappa statistic is frequently used to test interrater reliability. Keywords: kappa, reliability, rater, interrater.

Importance of measuring interrater reliability Many situations in the healthcare industry rely on multiple people to collect research or clinical laboratory data. Measurement of interrater reliability There are a number of statistics that have been used to measure interrater and intrarater reliability. Table 1. Calculation of percent agreement fictitious data. Open in a separate window. Table 2. Percent agreement across multiple data collectors fictitious data.

Is a rater an Outlier? Mark Susan Tom Ann Joyce of unlike responses: 1 1 1 1 1. Figure 1. Table 3. Figure 2. Figure 3. Confidence intervals for kappa Once the kappa has been calculated, the researcher is likely to want to evaluate the meaning of the obtained kappa by calculating confidence intervals for the obtained kappa. Conclusions Both percent agreement and kappa have strengths and limitations. Figure 4. Footnotes Potential conflict of interest None declared.



0コメント

  • 1000 / 1000