We aimed to determine the reproducibility of assessments made by independent reviewers of papers submitted for publication to clinical neuroscience journals and abstracts submitted for presentation at clinical neuroscience conferences. We studied two journals in which manuscripts were routinely assessed by two reviewers, and two conferences in which abstracts were routinely scored by multiple reviewers. Agreement between the reviewers as to whether manuscripts should be accepted, revised or rejected was not significantly greater than that expected by chance [κ = 0.08, 95% confidence interval (CI) –0.04 to –0.20] for 179 consecutive papers submitted to Journal A, and was poor (κ = 0.28, 0.12 to 0.40) for 116 papers submitted to Journal B. However, editors were very much more likely to publish papers when both reviewers recommended acceptance than when they disagreed or recommended rejection (Journal A, odds ratio = 73, 95% CI = 27 to 200; Journal B, 51, 17 to 155). There was little or no agreement between the reviewers as to the priority (low, medium, or high) for publication (Journal A, κ = –0.12, 95% CI –0.30 to –0.11; Journal B, κ = 0.27, 0.01 to 0.53). Abstracts submitted for presentation at the conferences were given a score of 1 (poor) to 6 (excellent) by multiple independent reviewers. For each conference, analysis of variance of the scores given to abstracts revealed that differences between individual abstracts accounted for only 10–20% of the total variance of the scores. Thus, although recommendations made by reviewers have considerable influence on the fate of both papers submitted to journals and abstracts submitted to conferences, agreement between reviewers in clinical neuroscience was little greater than would be expected by chance alone.
Peer review is central to the process of modern science. It influences which projects get funded and where research is published. Although there is evidence that peer review improves the quality of reporting of the results of research (Locke, 1985; Gardner and Bond, 1990; Goodman et al., 1994), it is susceptible to several biases (Peters and Ceci, 1982; Maddox, 1992; Horrobin, 1996; Locke, 1998; Wenneras and Wold, 1997), and some have argued that it actually inhibits the dissemination of new ideas (Peters and Ceci, 1982; Horrobin, 1990). These shortcomings might be tolerable if the peer review process could be shown to be effective in maximizing the likelihood that research of the highest quality is funded and published. Unfortunately, there is no objective standard of quality of a scientific report or grant application against which the sensitivity or specificity of peer review can be assessed. However, the lack of a quality standard does not prevent measurement of the reproducibility of peer review. How often do independent referees agree about the quality of a paper or abstract? Quality is related to factors such as originality, appropriateness of methods, analysis of results, and whether the conclusions are justified by the data given. Consistency in these assessments should lead to some agreement about overall quality. Poor reproducibility casts doubt on the utility of any measurement, whether made quantitatively by an instrument or assay, or qualitatively by a reviewer assessing a manuscript or abstract.
The authors wrote to the editors of five major clinical neuroscience journals requesting access to the assessments of manuscripts made by external reviewers. The editors of two of the journals were willing to allow this. Journal A sent all manuscripts to two independent reviewers. We studied a 6-month consecutive sample. Journal B allowed us to study a consecutive series of 200 manuscripts. We analysed reports on the 116 (59%) papers that had been assessed by two reviewers. The remainder had been assessed by only one reviewer (n = 54) or only one of the two reviewers had completed the structured assessment (n = 30). Both journals required the reviewers to complete a structured assessment form as part of their review of the manuscript. In both cases, the reviewers were asked to make the following assessments: (i) should the manuscript be accepted, revised or rejected?; (ii) was the priority for publication low, medium or high?
Agreement between the reviewers was calculated for each assessment. Agreement was expressed as a κ statistic (Thompson and Walter, 1988) rather than a simple percent- age, in order to measure the extent to which agreement was greater than that expected by chance. A κ value of 0 represents chance agreement and a value of 1 indicates perfect agreement. Intermediate κ values are generally classified as follows: 0–0.2 = very poor; 0.2–0.4 = poor; 0.4–0.6 = moderate; 0.6–0.8 = good; 0.8–1.0 = excellent. Negative κ values indicate positive disagreement.
Scores given to abstracts by independent reviewers were obtained for two clinical neuroscience conferences. For both meetings, the majority of abstracts submitted for poster presentation only were accepted. Our analyses were, therefore, limited to abstracts submitted for oral presentation (`platform preferred'). The scoring of these abstracts determined the manner in which they were presented; abstracts with the highest mean scores were allocated time for an oral presentation whereas those with lower scores were accepted as a poster or were rejected. For both conferences, the reviewers were requested to give each abstract an integer score between 1 (poor quality, unsuitable for inclusion in the meeting) and 6 (excellent). They were asked to consider both the scientific merit of the work and the likely level of interest to the conference participants. The abstracts were scored by 16 reviewers for Meeting A and 14 reviewers for Meeting B. Abstract scores were analysed by ANOVA (analysis of variance) using the statistical package SPSS for Windows, Release 6.1. The contributions of abstract identity and reviewer identity to the total variance amongst the abstract scores were determined.
Journal A accepted for publication 80 (45%) of the 179 papers submitted during the study period; Journal B accepted 47 (41%) of the 116 papers submitted during the study period. The reviewers for Journal A agreed on the recommendation for publication, or otherwise, for 47% of manuscripts and the reviewers for Journal B agreed for 61% of the manuscripts (Table 1). The corresponding κ values were 0.08 (95% CI = –0.04 to 0.20) and 0.28 (95% CI 0.12 to 0.40). The observed proportions of agreement are compared with the proportions that would have been expected by chance in Fig. 1.
Agreement between independent reviewers on the assessment of manuscripts submitted to two journals of clinical neuroscience. Reviewers were asked to assess whether manuscripts should be accepted, revised or rejected (Manuscript acceptance) and, if suitable for publication, whether their priority was low, medium or high (Manuscript priority). The observed agreements are compared with the level of agreement expected by chance. The error bars show the 95% confidence intervals.
For those manuscripts where both reviewers agreed that the paper was suitable for publication (with or without revision), they agreed on the priority for publication in 35% of cases for Journal A and 61% of cases for Journal B (Table 2). Corresponding κ values were –0.12 (95% CI = –0.30 to 0.11) and 0.27 (95% CI = 0.01 to 0.53). The observed proportions of agreement are compared with the proportions that would have been expected by chance in Fig. 1.
Assessments of two independent reviewers of the priority for publication of those papers submitted to clinical neuroscience journals where both reviewers recommended acceptance
Agreement = 35%, κ = –0.12 (95% CI –0.30 to 0.11)
Agreement = 61%, κ = 0.27 (95% CI 0.01 to 0.53)
Manuscripts that both reviewers agreed were suitable for publication (with or without revision) were more likely to be published than those for which they disagreed or both recommended rejection (Fig. 2). Journal A published 66 (92%) of the 72 manuscripts that were recommended for publication by both authors compared with 14 (13%) of 107 remaining manuscripts (odds ratio = 73, 95% CI = 27 to 200). Journal B published 40 (85%) of the 47 manuscripts recommended for publication by both reviewers compared with 7 (10%) of 69 remaining manuscripts (odds ratio = 51, 95% CI = 17 to 155).
The proportions of manuscripts submitted to two clinical neuroscience journals that were accepted for publication according to whether two independent reviewers both recommended acceptance, disagreed, or both recommended rejection. The error bars show the 95% confidence intervals.
There was statistically significant heterogeneity in the mean scores (ANOVA) given to the abstracts submitted to the conferences (Meeting A, 32 abstracts, P < 0.001; Meeting B, 28 abstracts, P < 0.005). There were also significant differences between the mean scores given by different reviewers (Meeting A, 16 reviewers, P < 0.001; Meeting B, 14 reviewers, P < 0.001). Over a quarter of the variance in abstract scores (27% for Meeting A and 32% for Meeting B) could be accounted for by the tendency for some reviewers to give higher or lower scores than others (Table 3). Only a small proportion of the variance in abstract scores could be accounted for by differences between the mean scores given to abstracts (11% for Meeting A and 15% for Meeting B) (Table 3).
Analysis of variance of the scoring of abstracts by reviewers for abstracts submitted to meetings A and B
Source of variation
Sum of squares
Degrees of freedom
In neither of the journals that we studied was agreement between independent reviewers on whether manuscripts should be published, or their priority for publication, convincingly greater than that which would have been expected by chance alone. The scoring of conference abstracts by a larger number of independent reviewers did not lead to any greater consistency. In other words, the reproducibility of the peer review process in these instances was very poor. Although the journals and meetings which we studied were not chosen at random, we believe that they are likely to be representative of their type, i.e. specialist journals and meetings in clinical neuroscience.
Poor inter-observer reproducibility of peer review has been reported in several non-medical sciences (Scott, 1974; McCartney, 1973; Cicchetti, 1980; Cole et al., 1981), but the few previous studies of peer review for medical journals have produced conflicting results. Locke reported inter-observer κ values ranging from 0.11 to 0.49 for agreement between a number of reviewers making recommendations on a consecutive series of manuscripts submitted to the British Medical Journal during 1979 (Locke, 1985). Agreement was significantly greater than that expected by chance, but this may have been because the reviewers making the recommendations were professional journal editors. Strayhorn and colleagues reported a κ value of 0.12 (poor agreement) for the accept–reject dichotomy for 268 manuscripts submitted to the Journal of the American Academy of Child and Adolescent Psychiatry (Strayhorn et al., 1993). Similarly low levels of agreement were reported by Scharschmidt et al. for papers submitted to the Journal of Clinical Investigation (Scharschmidt et al., 1994).
Previous studies of agreement between reviewers in the grading of abstracts submitted to biomedical meetings have produced results similar to our own (Cicchetti and Conn, 1976; Rubin et al., 1993). Rubin and colleagues reported κ values ranging from 0.11 to 0.18 for agreement between individual reviewers, and found that differences between abstracts accounted for 36% of the total variance in abstract scores. It has even been shown that the likelihood of an abstract being accepted can be related to the typeface used (Koren, 1986).
We also found that the assessments made by reviewers were strongly predictive of whether or not manuscripts were accepted for publication. Manuscripts recommended for publication by both reviewers were 50–70 times more likely to be accepted than those about which reviewers disagreed.
Given this reliance on peer review, should we be concerned about the lack of reproducibility? Some authors have argued that poor reproducibility is not a problem, and that different reviewers should not necessarily be expected to agree (Locke, 1985; Bailar, 1991; Fletcher and Fletcher, 1993). For example, an editor might deliberately choose two reviewers who he or she knows are likely to have different points of view. This may be so, but if peer review is an attempt to measure the overall quality of research in terms of originality, the appropriateness of the methods used, analysis of the data, and justification of the conclusions, then a complete lack of reproducibility is a problem. These specific assessments should be relatively objective and hence reproducible.
Why then is the reproducibility of peer review so poor? There are several possibilities. First, some reviewers may not be certain about which aspects of the work they should be assessing. Secondly, some reviewers may not have the time, the knowledge or training required to assess research properly. When deliberately flawed papers are sent for review the proportion of major errors picked up by reviewers is certainly low (Godlee et al., 1998). Thirdly, it is possible that reviewers do agree on the more specific assessments of the quality of research, but that this consistency is undermined by personal opinions and biases. For example, assessments of reviewers have been shown to be biased by the fame of the authors or the institution in which the work was performed (Peters and Ceci, 1982), and by conflicts of interest due to friendship or competition and rivalry between the reviewer and the authors (Locke, 1988; Maddox, 1992). It has been shown that reviewers recommended by authors themselves give much more favourable assessments than those chosen by editors (Scharschmidt et al., 1994).
How might the quality and reproducibility of peer review be improved? Neither blinding reviewers to the authors and origin of the paper nor requiring them to sign their reports appear to have any effect on the quality of peer review (Godlee et al., 1998). However, the use of standardized assessment forms has been shown to increase agreement between reviewers (Strayhorn et al., 1993), and appears to be particularly important in the assessment of study methods, the analysis of data and the presentation of the results (Gardner et al., 1986). Editors might also consider publishing a short addendum to papers detailing the major comments of the reviewers, along with their identity. However, it should be borne in mind that many researchers already spend as much time participating in peer review as they spend doing research (Gillett, 1993). Any increase in this considerable workload might be difficult to justify. Whether payment of reviewers for their reports, as is the practice of some journals, increases the quality of reports is unknown. This and other policies are quite amenable to testing in randomized controlled trials. Open peer review on the internet of articles submitted to journals is currently under investigation (Bingham et al., 1998).
Peer review of articles submitted to journals and abstracts submitted to meetings does achieve a number of important ends irrespective of whether it is reproducible. It helps those responsible to decide which papers or abstracts should be published or presented. The comments of reviewers do generally improve the quality of papers whether or not they are accepted for publication. Peer review also gives the impression that decisions are arrived at in a fair and meritocratic manner. Therefore, even if the results of peer review were essentially a reflection of chance, the process would still serve a useful purpose. However, given the biases inherent in peer review, the tendency to lead to suppression of innovation and the enormous cost of the process in terms of the time spent on the work by reviewers, the lack of reproducibility does cast some doubt on the overall utility of the process in its present form. Finally, many of the arguments that apply to peer review for journal articles and conference abstracts also apply to peer review of grant applications (Greenberg, 1998; Wessely, 1998). There is a need for further research into peer review in each of these areas.
We wish to thank Professor R. A. C. Hughes, Professor Jan van Gijn, Mrs E.B.M. Budelman-Verschuren, Suzanne Miller and Chris Holland for their help and collaboration. We are grateful to Kathy Rowan, Catharine Gale and Paul Winter for administrative help.
Scharschmidt BF, DeAmicis A, Bacchetti P, Held MJ. Chance, concurrence and clustering: analysis of reviewers' recommendations on 1000 submissions to the Journal of Clinical Investigation. J Clin Invest1994; 93: 1877–80.
Strayhorn J Jr, McDermott JF Jr, Tanguay P. An intervention to improve the reliability of manuscript reviews for the Journal of the American Academy of Child and Adolescent Psychiatry. Am J Psychiatry1993; 150: 947–52.