Brain Advance Access published online on May 21, 2008
Brain, doi:10.1093/brain/awn091
A plea for confidence intervals and consideration of generalizability in diagnostic studies
1Department of Psychiatry, University Clinic Freiburg, Freiburg, Germany, 2Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, London, United Kingdom, 3Department of Psychiatry and Psychology, Mayo Clinic, Scottsdale, AZ, USA (CMS) and Department of Radiology, Mayo Clinic, Rochester, MN, USA (CRJ), 4Dementia Research Centre, Department of Neurodegenerative Disease, Institute of Neurology, University College London, London, United Kingdom, 5Département d'études cognitives, Ecole Normale Supérieure, Paris, France and 6Laboratory of Neuroimaging, IRCCS Santa Lucia, Roma, Italy
Correspondence to:
Stefan Klöppel, MD, Department of Psychiatry, Hauptstr. 5, 79104 Freiburg, Germany E-mail: stefan.kloeppel{at}uniklinik-freiburg.de
.
Received April 18, 2008. Accepted April 18, 2008.
Sir, Many thanks for letting us respond to the interesting letter concerning our recent paper. We are grateful for the chance to clarify the points raised, which suggest our conclusions were too optimistic. In our paper (Kloppel et al., 2008
), we used MRI scans from pathologically proven cases of Alzheimer's disease and frontotemporal lobar degeneration (FTLD) to validate trained sets for a machine learning-based support vector machine (SVM) approach to the categorization of structural scans from normal and each other.
This rigorous approach substantially limited the number of available subjects, which we made perfectly clear in our article, but which was unavoidable given our novel approach. Frost and colleagues are right to point out that such low numbers result in larger confidence intervals than if we were able to include more scans. This is an object of our further empirical studies—what is the improvement in classification gained using this technique with greater numbers of scans in the trained set? The graph below (Fig. 1) illustrates diagnostic accuracy when the whole brain grey matter segment is used to separate probable Alzheimer's disease patients from all clinical stages (MMSE range of 3 to 30; defined clinically in the same way as group III, in our original paper) from controls. Classification is performed repeatedly and after removing one Alzheimer's disease patient and one control each time. Results are fairly stable but accuracy becomes more variable until a steep decline occurs when less than around 20 subjects per group are included. Suffice it to say we were surprised how well Alzheimer's disease was distinguished from FTLD given the even smaller numbers of validated scans we had available for that classification. To clarify these issues, we provide a table that supplements our data with CIs. Further, we found very similar results using two completely independent datasets and the CIs become relatively small when data from the first two datasets are combined. So, although we agree with the question posed theoretically, practically the results stand as proof of principle.
|
|
|
|
We also agree that some statements found on the BBC's website (BBC, 2008
It is important to emphasize that such multivariate methods generalize to new data. Figure 1 in our original paper illustrates that during training, samples from those individual subjects (i.e. normalized grey matter segments from either the whole brain or the hippocampus area), which best separate the two groups define the decision boundary. The figure is an example with two dimensions but in reality, the number of dimensions equals the number of voxels used. If a classifier generalizes well, a new scan will be assigned to the same side of the decision boundary as the rest of a diagnostic group. It is a critical part of our results that the decision boundary defined by data from one imaging centre using different hardware and sequences is sufficiently general to separate data accurately from other imaging centres. This ability is of great practical relevance as a library of well-defined cases can be made available to referral centres as a general trained set to diagnose scans collected there. While our results are promising, as we pointed out in our article, a formal comparison with modern conventional clinical assessment is required. It should be kept in mind that we used very strict inclusion criteria and the extension to relatively poorly defined data from primary referral centres needs to be addressed in a separate study. It is likely that libraries from very early stages of the disease need to be produced, which are then validated longitudinally or pathologically. The issue now is to optimize the variables to maximize sensitivity and accuracy. One lesson we learned is that proper validation of scans included in the trained set is likely to be critical.
Footnotes
*Both authors contributed equally to this work. ![]()
Acknowledgements
Funding to pay the Open Access publication charges for this article was provided by The Wellcome Trust.
References
Kloppel S, Stonnington CM, Chu C, Draganski B, Scahill RI, Rohrer JD, et al. Automatic classification of MR scans in Alzheimer's disease. Brain (2008) 131:681–9.
BBC. Computers spot Alzheimer's fast. (2008) http://news.bbc.co.uk/2/hi/health/7258379.stm.
Newcombe RG. Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med (1998) 17:857–72.[CrossRef][ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
