Brain, Vol. 123, No. 5, 1027-1040,
May 2000
© 2000 Oxford University Press
Kurtzke scales revisited: the application of psychometric methods to clinical intuition
Neurological Outcome Measures Unit, Institute of Neurology, London, UK
Correspondence to:
Dr Jeremy Hobart, Neurological Outcome Measures Unit, Institute of Neurology, Queen Square, London WC1N 3BG, UK E-mail: J.Hobart{at}ion.ucl.ac.uk
| Abstract |
|---|
|
|
|---|
When developing his disability scales for multiple sclerosis, Kurtzke demonstrated perception and insight. However, 45 years later, the evaluation of his clinically derived scales remains limited, particularly for more disabled patients. Indeed, many of Kurtzke's assumptions underpinning the development of the Expanded Disability Status Scale (EDSS) and Functional Systems (FS) are untested. This study aims to build on previous work and provide a more detailed examination using psychometric methods of the EDSS and FS. There are three study objectives: (i) to examine comprehensively the psychometric properties of the EDSS in more disabled people with multiple sclerosis undergoing in-patient rehabilitation; (ii) to examine the reliability of the FS and test Kurtzke's assumptions that they measure different aspects of the neurological examination and measure different constructs from that measured by the EDSS; and (iii) to examine whether the FS can be summed to generate a summary score. The EDSS was examined for its acceptability (score distributions), reliability (inter- and intra-rater reproducibility, standard error of measurement), validity (convergent and discriminant validity, measurement precision, discrimination between individuals) and responsiveness (effect size). The FS were examined for their reliability (inter- and intra-rater reproducibility), intercorrelations, correlations with the EDSS and the extent to which they satisfy Likert's criteria as a summed rating scale. In this more disabled sample of people with multiple sclerosis, the EDSS is an acceptable measure but demonstrates limited variability. Inter-rater reproducibility (intraclass correlation coefficient; ICC = 0.78) is adequate for group comparison studies, but intra-rater reproducibility is variable (ICC = 0.620.94). Convergent and discriminant validity for the EDSS is supported, but its measurement precision relative to the Functional Independence Measure is limited (56%). Also, the EDSS has a limited ability to distinguish between individuals in terms of their disability and its responsiveness is poor (effect size = 0.10). Results indicate that the FS measure constructs distinct from each other (intercorrelations = 0.23 to +0.52) and from the EDSS (correlations = 0.10 to +0.59). Intra-rater, but not inter-rater reproducibility is adequate for group comparison studies. The FS do not satisfy criteria as an eight-, seven- or six-item summed rating scale. Despite being based on sound clinical intuition, the lack of psychometric input into the development of the EDSS and FS has limited their usefulness as evaluative outcome measures in multiple sclerosis.
Kurtzke scales; psychometric evaluation; health measurement; multiple sclerosis; health status
BI = Barthel Index; CI = confidence interval; DSS = Disability Status Scale; EDSS = Kurtzke Expanded Disability Status Scale; FIM = Functional Independence Measure; FS = Kurtzke Functional Systems; GHQ = General Health Questionnaire; ICC = intraclass correlation coefficient; LHS = London Handicap Scale; MSFC = Multiple Sclerosis Functional Composite; SEM = standard error of measurement; SF-36 = Medical Outcomes Study Short-Form 36-Item Health Survey; SF-36 PCS = SF-36 Physical Component Summary Score; SF-36 MCS = SF-36 Mental Component Summary Score; SHO = senior house officer.
| Introduction |
|---|
|
|
|---|
When developing his disability scales, Kurtzke demonstrated perception and insight. He recognized that the effectiveness of therapeutic interventions could be determined only if disease severity could be quantified accurately (Kurtzke, 1961
There are three Kurtzke scales: (i) the Disability Status Scale (DSS) (Kurtzke, 1955
) is an 11-point scale (0 = normal neurological examination; 10 = death due to multiple sclerosis) measuring overall disability, developed to evaluate the effectiveness of isoniazid as a treatment for multiple sclerosis (Kurtzke and Berlin, 1954
); (ii) a modification of the DSS, which has replaced it, called the 20-point Expanded Disability Status Scale (EDSS) (Kurtzke, 1983
); (iii) the Functional Groups [more commonly known as Functional Systems (FS)] are eight scales representing different functions of the CNS (Kurtzke, 1961
). Each system is rated on a five-point (three systems) or six-point (four systems) response scale except `Other Functions' which is rated dichotomously (0 = none, 1 = any other neurological findings attributed to multiple sclerosis).
Although the DSS (1954) was published 7 years before the FS (1961), Kurtzke's writings indicate that these two instruments are intimately related. The DSS was designed to represent the sum of a person's neurological dysfunction (Kurtzke and Berlin, 1954
), and measure their maximal function given the neurological deficits (Kurtzke, 1955
). It was devised to depict, in the fewest number of steps (Kurtzke, 1961
), the progression of multiple sclerosis as it usually occurs, with an emphasis on ambulation (Kurtzke and Berlin, 1954
), but with the provision of grades for less involved individuals (Kurtzke, 1961
). In contrast, the FS were intended to delineate the type and severity of eight neurological impairments (Kurtzke, 1961
). Together, the DSS and FS were intended to form a complementary measurement method for quantifying the objective and verifiable deficits due to multiple sclerosis, as elicited from the neurological examination, in a reasonable and reproducible manner (Kurtzke, 1961
). Moreover, the FS were designed to provide two useful checks on the DSS. For less disabled patients, the DSS score should be greater than or equal to the highest score in any individual FS. No change in DSS number should occur without a change in one or more FS (Kurtzke, 1961
).
The DSS was developed to enable comparisons of disability within and between patients. Kurtzke argued that an overall index of disability was required for studies in multiple sclerosis as the eight FS could not be summed to indicate the total disorder (Kurtzke, 1961
). He cited five reasons in support of this argument. First, the eight FS are independent. Secondly, they are not equivalent to each other in representing the amount of disease present. Thirdly, clinical signs may follow different courses allowing fluctuations to be obscured by an unchanging total. Fourthly, a total score can improve while the patient worsens (e.g. when increasing weakness obscures cerebellar signs). Finally, Kurtzke believed it was not appropriate to sum scores from ordinal scales (Kurtzke, 1983
).
Users of the DSS argued that it was too insensitive in its middle range (especially step 7) and too limited for studies of chronic multiple sclerosis. These criticisms prompted Kurtzke to develop the EDSS by dividing each DSS step (except the first) in half. He reasoned that the Gaussian distribution of DSS scores in his study samples provided empirical evidence that no step was discrepant (Kurtzke, 1983
). Although the EDSS has also attracted criticism (Willoughby and Paty, 1988
; Noseworthy et al., 1990
; Goodkin et al., 1992
; Sharrack and Hughes, 1996
), it currently is the most widely used disability measure in clinical trials of multiple sclerosis (The IFNB Multiple Sclerosis Study Group, 1993; Jacobs et al., 1996).
As well as advocating disability measurement, Kurtzke recognized that rating scales must fulfil clinical and scientific criteria. He noted they should be user-friendly (Kurtzke, 1955
), applicable to all patients (Kurtzke and Berlin, 1954
; Kurtzke, 1961
), and reproducible (Kurtzke, 1955
). He stated that the sum total of any patient's disabilities should fit them into a suitable category on a scale (Kurtzke, 1955
) and, that any change in disability should be reflected in a change of status on that scale (Kurtzke, 1955
). In fact, he says that there is only one good reason for quantitative schemes in multiple sclerosis: to document change (Kurtzke, 1965
). In short, Kurtzke appreciated that scales should be clinically useful (capable of being incorporated into routine clinical practice) and scientifically sound (acceptable, reliable, valid and responsive).
Despite his insight, Kurtzke did not examine the measurement properties of his scales as the expertise necessary was not readily accessible. While methods for developing and evaluating measures of psychological constructs (psychometric methods) were already established in the social sciences (Guilford, 1936
), these methods were only applied to medicine in the 1970s (Brook et al., 1979
). Textbooks on health measurement did not appear until the late 1980s (McDowell and Newell, 1987
; Streiner and Norman, 1989
), and formal guidelines for the development and evaluation of health measures have been proposed only recently (Scientific Advisory Committee of the Medical Outcomes Trust, 1995
; McDowell and Jenkinson, 1996
; Fitzpatrick et al., 1998
).
In the last decade, increasing awareness of psychometric methods has resulted in a number of studies examining the measurement properties of the EDSS and the FS (e.g. Amato et al., 1987, 1988; Noseworthy et al., 1990; Francis et al., 1991; Verdier-Taillefer et al., 1991; Goodkin et al., 1992; Marolf et al., 1996; Sharrack et al., 1996b, 1999). Most of these studies have concentrated on reliability and not examined validity or responsiveness. This approach is limited because psychometric properties are sample and not instrument dependent (Cronbach, 1949
), reliability is a necessary but not sufficient condition for validity (Nunnally, 1978
), and adequate reliability and validity do not guarantee responsiveness (Guyatt et al., 1987
). Consequently, all relevant measurement properties must be examined comprehensively in patient samples. In a recent article, Sharrack and colleagues (Sharrack et al., 1999
) published the first study to examine the reliability, validity and responsiveness of the EDSS, and the reliability of the FS in the same sample and provides a good base for further, more detailed psychometric evaluations of these instruments.
In this article, the psychometric properties of the EDSS are examined comprehensively. In contrast to other studies: the sample is patients who are more disabled undergoing rehabilitation; acceptability is examined; reliability is determined in the clinical context in which the instrument is used routinely; the validation strategy addresses specifically the purposes for which the EDSS was designed; and responsiveness is determined prospectively. The reliability of the FS is examined and Kurtzke's assumptions that the FS measure eight different aspects of the neurological examination and address different constructs from that measured by the EDSS are tested. Finally, the issue of whether the FS satisfy criteria as a summed rating scale is addressed.
| Method |
|---|
|
|
|---|
Samples
Patients were recruited with clinically definite multiple sclerosis (Poser et al., 1983
Outcome measures
The following outcome measures were administered on admission and discharge in strict accordance with the developer's guidelines.
Barthel Index (BI) (Mahoney and Barthel, 1965
)
The BI measures disability as independence in 10 activities of daily living. Items are rated from behavioural observation on a two-point (two items), three-point (six items) or four-point (two items) response scale and summed to generate a total score (low scores indicate greater disability). Wade's version of the BI was used (Collin et al., 1988
), and has been shown to be reliable, valid and responsive (Wade, 1992
; van Bennekom et al., 1996
).
Functional Independence Measure (FIM) (Granger et al., 1986
)
The FIM measures disability as the assistance required to perform 18 tasks. Items are rated on a seven-point scale and summed to generate a total score (low scores indicate greater disability). The FIM has been shown to be reliable, valid and responsive in people with multiple sclerosis (Granger et al., 1990
; Brosseau, 1994
).
London Handicap Scale (LHS) (Harwood and Ebrahim, 1995
)
The LHS is a six-item self-report measure of handicap. Items are rated on a six-point weighted scale and summed to generate a total score (high scores indicate greater handicap). The LHS is reliable, valid and responsive.
Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36) (Ware, 1993
)
The SF-36 measures self-reported health status in eight dimensions or two summary scores [physical component summary score (PCS) and mental component summary score (MCS)] (Ware et al., 1994
). Low scores indicate worse health. The reliability and validity of the SF-36 are the subject of numerous studies. In this study, summary scores are reported.
General Health Questionnaire (GHQ) (Goldberg and Hillier, 1979
)
The GHQ is a 28-item self-report measure of psychological distress. Items are rated on a two-point scale and summed to generate a total score (high scores indicate greater distress). Evidence supports its reliability and validity (McDowell and Newell, 1987
).
Staff-rated transition question (Fitzpatrick et al., 1993
)
The transition question used in this study is a four-point staff rating of change in disability at discharge (1 = no change; 4 = marked improvement).
EDSS
Four psychometric properties were studied: acceptability, reliability, validity and responsiveness.
Acceptability
This concerns the extent to which the spectrum of health status measured by an instrument is representative of the distribution of health status in the sample (Ware et al., 1978
). It is determined by examining the score distributions. An instrument is considered acceptable when: scores demonstrate good variability and span the full scale range (Ware et al., 1980
; Stewart and Ware, 1992
); mean scores are situated near the scale mid-point (Eisen et al., 1979
); floor and ceiling effects, calculated as the percentage of responses for the minimum and maximum scores, respectively, do not exceed 15% (McHorney and Tarlov, 1995
); and score distributions are not skewed excessively (McHorney et al., 1994
), with skewness statistics in the range 1 to +1 (Holmes et al., 1996
). In this study, the relative acceptability of the EDSS, BI and FIM are compared.
Reliability
The scores generated by a measurement instrument represent the sum of two components: true score and random error (Guilford, 1936
). Reliability refers to the extent to which an instrument is free from random error (Nunnally, 1978
). As reliability increases (or decreases), scores are more (or less) consistent and, therefore, measured variance reflects true variance (or random error) in the construct (Stewart and Ware, 1992
). In keeping with this definition, reliability coefficients are estimates of the proportion of total score variance that is due to true score variance.
Reliability studies aim to quantify the main sources of random error associated with an instrument (Stanley, 1971
). Variability within (intra-rater) and between (inter-rater) observers are the major sources of random error associated with observer-rated instruments such as the EDSS and FS (Shumaker et al., 1997
). The most appropriate method of estimating reliability of single item measures such as the EDSS and individual FS is the testretest reproducibility method (Stewart and Ware, 1992
) which examines the agreement between paired ratings of patients and attributes measured variance to random error (Nunnally, 1978
).
Inter-rater reproducibility was estimated by calculating the agreement between independent ratings of patients generated within 48 h of admission by J.H. and the neurology senior house officer (SHO). Intra-rater reproducibility was estimated for both raters on the same subsample of patients by calculating the agreement between paired ratings generated 4872 h apart. All reproducibility results are reported as intraclass correlation coefficients (ICC; Bartko, 1966), for which the estimates of variance were obtained from a repeated measures analysis of variance table using a fixed effects model (Shrout and Fleiss, 1979
). Minimum recommended standards for reproducibility are 0.70 for group comparisons and 0.900.95 for individual comparisons (Nunnally, 1978
).
As the outcomes of some studies of multiple sclerosis are based on individual EDSS score changes, reproducibility estimates are also reported as standard errors of measurement (SEM). These estimate the standard deviation (SD) of scores obtained if an instrument was administered to the same individual multiple times. Consequently, they are used to determine confidence intervals (CI) around individual patient scores (Guilford, 1954
), reflect an instrument's accuracy for individual patient assessment and clinical decision-making (Williams and Naylor, 1992
), and gauge the likelihood that an individual patients' change in score is attributable to true change (McHorney and Tarlov, 1995
). The following formulae are used (Anastasi and Urbina, 1997
):
![]() |
Validity
This is defined as the extent to which an instrument measures the construct it purports to measure (American Educational Research Association, National Council on Measurement used in Education, 1955
). There are many methods of gathering evidence for the validity of a measure, but construct validation is used when no criterion (gold standard) or content domain is accepted as entirely adequate to define the attribute being measured. Construct validity is defined as the extent to which empirical data support hypotheses concerning the construct the instrument is purported to measure (Cronbach and Meehl, 1955
). As there are many types of evidence under the rubric of construct validity, and validity is not a fixed measurement property, a strategy is required that examines the validity of an instrument for the specific purposes and specific setting in which it is being used (McDowell and Jenkinson, 1996
). In accordance with Kurtzke's recommendation (Kurtzke, 1961
), this study examines the validity of the EDSS as an overall measure of disability, and evaluates its ability to detect differences between groups and individuals with multiple sclerosis on account of their disability.
The extent to which the EDSS is a measure of disability, and not a measure of related health constructs, is determined by examining convergent and discriminant validity (Cronbach and Meehl, 1955
). In this analysis, correlations are examined between the EDSS and other measures and variables. The extent to which their direction, magnitude and pattern conform with a priori hypotheses indicates the strength of this validity evidence. If the EDSS is a measure of overall disability, four findings are predicted.
- (i) Correlations with disability measures (BI and FIM) should be high (r > 0.80).
- (ii) Correlations with measures of handicap (LHS), mental health status (SF-36 MCS) and psychological distress (GHQ) should be low (r < 0.30). Correlations between the EDSS and are also predicted to be low as the SF-36 PCS scale measures physical role limitations, pain and general health perceptions as well as physical functioning.
- (iii) Correlations between the EDSS and the LHS and SF-36 PCS should exceed correlations between the EDSS and the SF-36 MCS and GHQ. This is because handicap and physical health status are more closely related conceptually to disability than mental health status and psychological distress.
- (iv) The EDSS should be uncorrelated with age (r < 0.10) as disability in this sample of multiple sclerosis patients is not expected to be biased by this variable.
- (ii) Correlations with measures of handicap (LHS), mental health status (SF-36 MCS) and psychological distress (GHQ) should be low (r < 0.30). Correlations between the EDSS and are also predicted to be low as the SF-36 PCS scale measures physical role limitations, pain and general health perceptions as well as physical functioning.
The extent to which the EDSS is capable of discriminating between groups of multiple sclerosis patients on the basis of their disability is determined by comparing its measurement precision, relative to the FIM and BI, for detecting change in disability due to rehabilitation. On discharge from the rehabilitation unit, the treating therapists rated each patient's change in disability as either none, minimal, moderate or marked. Two groups were formed: none or minimal change, and moderate or marked change. The measurement precision of the EDSS is quantified as the degree to which it separates these two groups (the difference between their mean scores) relative to the variance within the groups. F-statistics, derived from a one-way analysis of variance, take both of these attributes into account as they indicate the ratio of between-groups (systematic) variance to within-group (error) variance (McHorney et al., 1993
). The higher the F-statistic, the greater the measurement precision. By comparing the EDSS, BI and FIM in the same sample, relative measurement precision is estimated as the ratio of pairwise F-statistics (F for one measure divided by F for another) and indicates, as a percentage, how much more (or less) precise one measure is compared with another at detecting group differences (McHorney et al., 1992
). In this study, the instrument with the largest F-statistic is chosen as the arbitrary standard and assigned a relative measurement precision of 1.
The extent to which the EDSS is able to discriminate between individuals on account of their disability is determined by examining the variability of BI and FIM scores for patients scoring at each EDSS level. The ability of the EDSS to discriminate between individuals is inversely related to the variability of BI and FIM scores.
Responsiveness
This is defined as the ability of an instrument to detect clinically significant change in the construct measured, even if that change is small (Guyatt et al., 1987
). It is often determined by comparing scores before and after an intervention expected to alter the quantity being measured, and calculating an effect size (standardized change score) (Scientific Advisory Committee of the Medical Outcomes Trust, 1995
). There are many effect size calculations. In this study, it is defined as the mean change score (admission minus discharge EDSS scores) divided by the standard deviation of the admission scores (Kazis et al., 1989
): larger effect sizes indicate greater responsiveness. However, effect sizes provide limited information about the responsiveness of measures as they reflect the magnitude of the change induced by the intervention, as well as the ability of the instrument to detect change (Norman et al., 1997
). Consequently, effect sizes for the EDSS, BI and FIM are compared in order to determine the relative ability of these instruments to detect change in patients undergoing rehabilitation.
FS
Reliability
The inter- and intra-rater reproducibility of the FS was examined using the same method described above for the EDSS.
Relationships among the FS and between the FS and EDSS
Relationships among the FS are determined by examining their intercorrelations. The magnitude of the correlations indicates the strength of the relationships. Kurtzke's assumption that the FS measure independent aspects of the neurological examination is supported when their intercorrelations are positive and low to moderate (<0.60). Relationships between the FS and EDSS are determined by examining their intercorrelations. Kurtzke's assumption that the EDSS measures a different construct from each of the individual FS is supported when correlations are positive and low to moderate (<0.60). It is predicted that correlations between the EDSS and FS should exceed intercorrelations among the FS as the EDSS is purported to represent the sum of the disabilities defined by patients' neurological deficits.
Do the FS constitute a summed rating scale?
Kurtzke hypothesized that the FS could not be summed to indicate the total disorder (Kurtzke, 1961
). However, it is appropriate to examine this untested assumption because the FS were developed to measure different aspects of the same underlying construct (the neurological examination), and summed rating scales have superior measurement properties to single item measures (Nunnally, 1978
). Likert demonstrated that items with ordinal response categories can be summed to generate total scores when they: measure the same underlying construct; measure at similar points on a scale; have similar variances; and contribute equal proportions of information to the total score (Likert, 1932
). These four criteria are satisfied when items are internally consistent (inter-related) and have equivalent mean scores, variances and corrected itemtotal correlations. A group of items is internally consistent when: the mean item intercorrelation exceeds 0.30 (Eisen et al., 1979
); correlations between each item and the total score computed from the sum of the remaining items (corrected itemtotal correlation; Howard and Forehand, 1962
) exceed 0.40 (McHorney et al., 1994
); and when the Cronbach's alpha coefficient (Cronbach, 1951
) exceeds 0.70 (Nunnally, 1978
; Scientific Advisory Committee of the Medical Outcomes Trust, 1995
). When itemtotal correlations exceed 0.30, the criteria of equivalent item means, variances and itemtotal correlations can be considered satisfied, even if they vary (Ware et al., 1997
).
The extent to which the FS satisfy scaling criteria as eight-item, seven-item and six-item summed rating scales is examined. First, all eight FS are examined. Next, because `Other Functions' is scored dichotomously rather than on a multi-point response scale and does not measure a specific aspect of the neurological examination, scaling criteria are examined for the remaining seven FS. Finally, `Mental Functions' is also removed, as it is rarely reported in studies of the FS, and scaling criteria are examined for the remaining six items.
Data collection
Consecutive admissions were enrolled between April 1, 1994 and April 1, 1996. Inter-rater reproducibility of the EDSS and FS was determined for all patients, together with intra-rater reproducibility on consecutive admissions between May 1995 and February 1996. BI, FIM, LHS, SF-36, GHQ and staff-rated transition question scores were collected on patients randomly selected to participate in a larger study evaluating the measurement properties of disability measures. Admission and discharge EDSS and FS scores were collected independently by J.H. and the neurology SHO at the Neurorehabilitation Unit. BI and FIM ratings were undertaken by the treating multi-disciplinary team for each patient. All raters received training in scoring Kurtzke scales and the FIM.
| Results |
|---|
|
|
|---|
Samples
Of the 137 patients admitted with clinically definite multiple sclerosis during the study period, none declined to participate. Inter-rater reproducibility was examined in 125 patients (91%) and intra-rater reproducibility in 40. Sixty-four patients participated in the evaluation of the measurement properties of disability measures. Table 1
|
EDSS
Acceptability
Table 2
|
Reliability
Eleven raters were used, i.e. J.H. and 10 consecutive SHOs. Table 3
|
|
Validity
Convergent and discriminant validity (Table 5
The EDSS correlates highly with the BI and the FIM, and poorly with the LHS, SF-36 PCS, SF-36 MCS, GHQ and age. The direction, pattern and magnitude of correlations are consistent with a priori hypotheses. These findings provide evidence for convergent and discriminant validity and indicate that the EDSS measures overall disability, and discriminates disability from related health constructs.
|
Measurement precision (Table 6
Transition question ratings are available for 59 of the 64 patients (92%). As the FIM has the largest F-statistic, it is nominated as the arbitrary standard and assigned a measurement precision of 1. The relative measurement precision of the BI (74%) and EDSS (56%) are notably less, indicating that the EDSS is least able to discriminate between patients in terms of disability.
|
Discrimination between individuals.
Table 2
|
Responsiveness
Table 8
|
FS
Reliability
Table 3
Intercorrelations among the FS
Table 9
demonstrates that intercorrelations among the eight FS range from 0.23 to +0.52, indicating that all relationships are weak or moderate. The majority of correlations (89%) are
0.37, indicating weak relationships. These findings support Kurtzke's hypothesis that these eight scales measure different constructs. The correlation between Pyramidal and Cerebellar Functions is 0.23, indicating a weak negative relationship.
|
Correlations between the FS and EDSS
Table 9
The FS as a summed rating scale
Table 10
demonstrates that the FS fail to satisfy criteria as eight-, seven- or six-item summed rating scales because mean item intercorrelations are <0.30 and alpha coefficients are <0.70. Furthermore, the alpha coefficients do not increase substantially when each FS is omitted (alpha with item deleted), indicating that none of them is compromising the internal consistency of the scales. These findings support Kurtzke's assumption that the FS cannot be summed to generate a total score.
|
| Discussion |
|---|
|
|
|---|
It is increasingly recognized that the evaluation of therapeutic interventions for multiple sclerosis should include measures of patient-oriented outcomes such as disability. If these outcomes are to influence health policy, neurologists must be confident that the data are rigorous. Psychometric methods enable clinicians to develop scientifically sound measures and evaluate the measurement properties of instruments, such as the Kurtzke scales, that were developed before these techniques were available. To date, the psychometric evaluation of the EDSS and FS has been limited.
The findings of this study serve to highlight both the strengths and weaknesses of the EDSS and FS in disability measurement. The EDSS addresses a broader spectrum of disability than other measures, has inter-rater reproducibility adequate for group comparison studies, measures overall disability and discriminates disability from other health constructs. However, its intra-rater reproducibility is variable and, compared with other disability measures, the EDSS has a limited ability to distinguish between individuals or groups on the basis of disability and has poor responsiveness. The FS measure eight distinct constructs that are different from the construct measured by the EDSS. Although the intra-rater reproducibility of most of the FS satisfies criteria for group comparison studies, their inter-rater reproducibility does not. They do not constitute an eight-, seven- or six-item summed rating scale.
Whilst it is common for studies to document a measure of central tendency for scores, it is uncommon for acceptability data to be reported comprehensively. However, these descriptive statistics indicate the probable relevance of an instrument to a clinical setting before a full psychometric evaluation is undertaken. For example, the ceiling effect of admission scores represents a subsample of patients with no potential to improve their score regardless of any change induced by the intervention. Consequently, ceiling effects adversely influence responsiveness. Furthermore, acceptability data aid the interpretation of other psychometric analyses. For example, correlations between measures are affected by the distribution of their data (Nunnally, 1975
). When the score distributions for two measures are different (e.g. one skewed, the other distributed more normally), their correlation will be attenuated. As the selection criteria for studies of multiple sclerosis inevitably limit the spectrum of disease severity included, acceptability data are particularly relevant.
Previous reliability studies of the EDSS and FS have generated variable results. Comparison of these studies is limited as they have used different statistics, methods and samples. Results have been reported as ICCs (Goodkin et al., 1992
; Sharrack et al., 1999
), Kappa (K) statistics (Amato et al., 1988
; Noseworthy et al., 1990
; Francis et al., 1991
; Sharrack et al., 1996b
, 1999
) and weighted Kappa (Kw) statistics (Amato et al., 1987
). Sharrack and colleagues demonstrate that ICCs and K generate different results (Sharrack et al., 1999
). Even studies using recommended methods (ICC or Kw; Scientific Advisory Committee of the Medical Outcomes Trust, 1995
; Streiner and Norman, 1995
; McDowell and Jenkinson, 1996
) report variable results. This finding may indicate that EDSS reliability is not stable across samples with different disability levels (e.g. the EDSS range 1.03.5 was studied by Goodkin et al., 1992; the EDSS ranges 09.5 and 07.5 were studied by Sharrack et al., 1999). Alternatively, inconsistent reliability results may reflect the different numbers of raters used, their level of training or expertise, or the sample size. Whilst there are no absolute guidelines for these variables, it is essential that the method used to determine reliability, and the sample in which it is studied, are representative of those in which the instrument is to be used.
SEMs have not been reported before for Kurtzke scales. Instead, most studies report percentage agreement within given EDSS and FS ranges (e.g. Amato et al., 1987; Francis et al., 1991; Goodkin et al., 1992). Although this method seems intuitively sensible, questions have been raised about its usefulness as an index of reliability because it does not correct for chance agreement (Cohen, 1960
). SEMs have the additional advantage of generating CI which are more familiar to clinicians than reliability coefficients (Streiner and Norman, 1995
).
Relatively little attention has focused on the validity of the EDSS. Previous studies report high correlations with the FIM (Brosseau, 1994
; Marolf et al., 1996
; Sharrack et al., 1999
) and BI (Sharrack et al., 1999
), supporting our evidence for the convergent validity of the EDSS. Only Sharrack and colleagues examined convergent and discriminant validity together (Sharrack et al., 1999
). The importance of this approach to validity examination is that a measurement instrument is far more valuable if it measures only the intended concept, i.e. if it is both specific and sensitive (McDowell and Jenkinson, 1996
). In contrast to our findings, Sharrack and colleagues report that the EDSS correlates similarly with the BI (0.74) and LHS (0.69) (Sharrack et al., 1999
). They also report a similar correlation between the EDSS and EuroQoL quality of life thermometer (0.69). The implication of these finding is that, in their less disabled sample, the EDSS has a limited ability to discriminate disability from handicap or quality of life and that measurement of these conceptually distinct health constructs is confounded.
There is consensus that instruments used to evaluate therapeutic interventions should be able to detect change (Fitzpatrick et al., 1998
). The importance of responsiveness lies in the trade-off between sample size and statistical power: instruments with greater responsiveness have higher power for a fixed sample size, or require fewer patients to achieve a fixed level of statistical power (Liang et al., 1985
). Despite this fact, the responsiveness of the EDSS, evaluated using recommended methods, is not well documented. Nevertheless, results of previous studies allude to its limited responsiveness. Marolf and colleagues studied multiple sclerosis patients before and after in-patient rehabilitation, and demonstrated that 95% of patients had unchanged EDSS scores, whereas 68% had unchanged FIM scores (Marolf et al., 1996
). Kurtzke reports that 50% of patients `whose exacerbation leading to admission had persisted for 2 years or less' (Kurtzke, 1955
), and 61% of patients admitted for `an early bout of' multiple sclerosis (Kurtzke, 1983
) had unchanged DSS scores at discharge (length of stay and details of treatment not reported). Finally, in the study of isoniazid (Veterans Administration Multiple Sclerosis Study Group, 1957
), 53% of patients categorized as having active disease had the same DSS scores 9 months later.
Only Sharrack and colleagues report the responsiveness of the EDSS as an effect size and compare this with other instruments (Sharrack et al., 1999
). Although their results lead to a similar conclusion to this study, Sharrack determines responsiveness retrospectively by calculating an effect size only for those patients deemed to have changed. In the present study, responsiveness is determined prospectively by comparing the EDSS scores for the whole sample pre- and post-rehabilitation. Recently, Norman and colleagues have compared prospective and retrospective methods of determining responsiveness. They provide empirical evidence that there is no consistent relationship between the two methods and that responsiveness calculated retrospectively does not indicate the ability of an instrument to detect change (Norman et al., 1997
).
The measurement properties of the FS have been the subject of a number of studies. As with the EDSS, variable reliability results are reported. Two studies (Kurtzke, 1970
; Weinshenker et al., 1996
) document correlations for the FS. Kurtzke examined the relationships among six FS (Visual and Other Functions not reported), and between these FS and the DSS. Results are similar to ours, with one notable exception, a correlation between Pyramidal and Cerebellar Functions of +0.32 (0.23 in the present study). Sample differences probably explain this discrepancy. Kurtzke studied less disabled patients (80% DSS < 6.0) in which a weak positive correlation is predicted; we have studied more disabled patients in whom increasing weakness obscures cerebellar signs (Kurtzke, 1961
), and a weak negative correlation is predicted. Weinshenker and colleagues (EDSS < 8.0) report three FS intercorrelations, all similar to ours: Brainstem with Cerebellar FS (0.33), Brainstem with Cerebral Functions (0.25) and Pyramidal Functions with the EDSS (0.54) (Weinshenker et al., 1996
). The results of these two studies, along with our findings, support Kurtzke's hypotheses that the eight FS and DSS/EDSS measure different constructs, although these results alone do not tell us the nature of those constructs.
The extent of these similarities between these three studies is perhaps surprising for two reasons. First, it is widely believed that the EDSS is an impairment scale in its lower range and a disability scale in its upper range (Sharrack and Hughes, 1996
). Secondly, low EDSS scores are fully defined by the FS grades, whereas high scores are more independent of them (Kurtzke 1983
). Whilst there is evidence that the validity of an instrument is not constant throughout the range of scores (Lee and Foley, 1986
), the extent to which these assumptions influence the measurement properties of the EDSS and its relationships with the FS are empirical questions that have yet to be answered.
This study has a number of limitations in relation to the generalizability of the results and the reliability of the methodology. The measurement properties of the EDSS and FS have been examined in a sample of moderately and severely disabled multiple sclerosis patients, largely in the progressive phase of the disease, individually selected for in-patient rehabilitation. As this sample is not representative of all multiple sclerosis patients (natural history studies suggest that only 55% of multiple sclerosis patients have EDSS scores >5.0), our results should be generalized with caution to dissimilar samples. This is particularly relevant as results from different studies suggest that some of the measurement properties of the EDSS are not stable across samples.
Suboptimal reproducibility methods are used in this study as we have studied only the agreement between one rater and a group of 10 others. Are the results biased by the fact that one rater was constant throughout the study? Subgroup analyses suggest that this is not the case, although it must be noted that the sample sizes involved are small. Whilst J.H. generated EDSS ratings higher than the SHOs in seven of 10 comparisons, only two were statistically significant (P < 0.05). For the FS, J.H. generated higher ratings than the SHOs in 34 of 70 comparisons.
Reproducibility studies using a small number of raters also have limited generalizability. The optimum method would have examined the agreement of EDSS and FS scores within and between raters randomly selected from a large pool (Shrout and Fleiss, 1979
). However, other EDSS and FS reproducibility studies have used fewer raters (e.g. Amato et al., 1987; Francis et al., 1991; Goodkin et al., 1992; Sharrack et al., 1999), which may explain why some inter-rater reproducibility results are so variable. Most importantly, we have examined the reliability of EDSS and FS in a manner representative of the clinical setting in which they are used.
Results from this and other studies have important implications for the use of the EDSS and FS. The finding that reliability is variable and has often failed to satisfy the criterion for group comparisons supports recommendations that raters of Kurtzke scales should be trained (Lechner-Scott et al., 1997
), and that this improves reliability and is likely to improve responsiveness. It is, however, notable in the present study that the more experienced rater has higher reproducibility for the EDSS but lower reproducibility for most of the FS. Although the FS result is not expected, it may be explained by the limited reliability of single-item measures, an issue that is discussed later. This study has too few raters to determine the relationship between experience and reproducibility.
Perhaps more importantly, the finding that EDSS reliability rarely satisfies the criterion for individual comparisons calls into question the use of treatment failure as the primary outcome of trials for therapeutic interventions in multiple sclerosis. In this method, EDSS scores for individuals are compared in order to determine unequivocal deterioration in disability (Weinshenker et al., 1996
). The demonstration in this study that a change of 0.5 EDSS points in the range 5.09.0 can be attributed to measurement error questions the recommendation that this degree of change is significant. Our results suggest that reliability must exceed 0.93 for this criterion to be adopted, which is consistent with the recommendations of others (Nunnally, 1978
; Scientific Advisory Committee of the Medical Outcomes Trust, 1995
).
The implications of the validity and responsiveness results are particularly important. Although we have demonstrated that the EDSS in the range 5.09.0 measures disability and is able to discriminate disability from related health constructs, we have also demonstrated that it has a limited ability to discriminate between individuals and groups and has poor responsiveness. As Kurtzke pointed out, when measures are used to evaluate therapeutic effectiveness, the ability to differentiate between people and groups, and to detect change over time is essential. When these properties are not demonstrated, the finding that a therapeutic intervention appears ineffective may represent limitations of the measure rather than of the treatment.
Criticisms of the EDSS have resulted in research directed towards the development of new measures (Whitaker et al., 1995
), such as the Multiple Sclerosis Functional Composite (MSFC; Cutter et al., 1999
). This is a three-item measure developed by analysing longitudinal data sets from the placebo arms of clinical trials and natural history studies that contained both clinical and functional measures. From these data, clinically relevant variables with good measurement properties were selected. The three items are stand-alone measures of different clinical dimensions: the nine hole peg test (arm dimension); the timed walk (leg dimension); and the 3 min version of the paced auditory serial addition test (PASAT3; cognitive dimension). Raw scores for each item are transformed into Z-scores (standard scores) so that they have a common metric, and then summed to generate a composite score. Selecting items from a pool on the basis of empirical performance is consistent with psychometric methods (Spector, 1992
), and research is now required to examine whether the three items of the MSFC satisfy criteria as a summed rating scale and, if so, is it reliable, valid and responsive.
The MSFC represents a development in measurement methods that is well recognized by psychometriciansthe superiority of multi-item over single-item measures. The EDSS and each FS, like the Rankin and Ashworth scales (Rankin, 1957
; Ashworth, 1964
), are single-item measures. An alternative method of scaling the attributes of people is to combine multiple items, each of which measures a different aspect of the underlying construct. This is the basis of Likert or summed rating scales (Likert, 1932
), examples of which are the BI, FIM, SF-36 and EuroQoL (EuroQoL Group, 1990
). Likert demonstrated that scores based on the simple assignment of integral weights to items' responses (i.e. 0, 1, 2, 3, etc.) correlated 0.99 with Thurstone's more complicated scoring system (Thurstone, 1928
) in which the step intervals for scaling items were equalized. Empirical evidence has demonstrated repeatedly that Likert's method of summing items rated on ordinal scales is a valid method of measurement. Furthermore, multi-item measures have been shown to be more reliable, valid and precise than single-item measures of psychological (Nunnally, 1978
; McIver and Carmines, 1981
; Spector, 1992
) and health (McHorney et al., 1992
) constructs. This study provides further support for the limitations of single-item health measures by demonstrating that the EDSS has measurement properties inferior to the BI and FIM.
The superiority of multi-item measures is easily explained. Conceptually, single items are unlikely fully to represent complex theoretical concepts (Nunnally, 1978
) such as disability. This is exemplified by criticisms that the EDSS in the range studied here is too focused on ambulation and does not account for other aspects of disability (Sharrack and Hughes, 1996
). Furthermore, single items are unable to make the fine differentiations among people that are desirable for most measurement problems (Nunnally, 1978
). Consider the EDSS and the FIM. The EDSS defines 20 levels of disability. In contrast, each of the 18 FIM items measures a different aspect of disability on a seven-point response scale (1 = maximum disability, 7 = minimum disability). Item scores are summed to generate a total score which defines 108 disability levels.
Another limitation of single-item measures is that they are unreliable (Nunnally, 1978
). This is the most likely explanation for the variable and largely poor reliability results of the EDSS and FS that the different studies have reported. However, unreliability averages out when scores on numerous items are summed to obtain a total score (measurement error is random). Perhaps the most important limitation of single-item measures is that their measurement properties are difficult to examine (McIver and Carmines, 1981
). This is exemplified by the FS: determining their validity is difficult. Indeed, we have not attempted this.
In conclusion, Kurtzke's scales represent an important milestone in the history of the measurement of disease severity in multiple sclerosis. Widespread international use over 45 years confirms their clinical usefulness. However, limitations in their scientific rigour, predictable from psychometric theory, severely question their role as evaluative outcome measures in future therapeutic studies of multiple sclerosis. Recently, a number of multi-item measures for multiple sclerosis have been developed using psychometric techniques (Vickrey et al., 1995
; Cella et al., 1996
; LaRocca et al., 1996
; Sharrack et al., 1996a
; Ford et al., 1997
). Studies are required to examine and compare the measurement properties of these instruments so that choice for future treatment trials can be empirically led.
| Acknowledgments |
|---|
We wish to thank the staff of the Neurorehabilitation Unit and the 10 rotating neurology SHOs for collecting outcomes data routinely, and to Professor Ian McDonald for his helpful comments on an earlier draft. Between September 1995 and November 1996, Dr Hobart was funded by a Wellcome Training Fellowship in Health Services Research.
| References |
|---|
|
|
|---|
Amato MP, Groppi C, Siracusa G, Fratiglioni L. Inter and intra-observer reliability in Kurtzke scoring systems in multiple sclerosis. Ital J Neurol Sci 1987; Suppl 6: 12931.
Amato MP, Fratiglioni L, Groppi C, Siracusa G, Amaducci L. Interrater reliability in assessing functional systems and disability on the Kurtzke scale in multiple sclerosis. Arch Neurol 1988; 45: 7468.
American Educational Research Association, National Council on Measurement used in Education. Technical recommendations for achievement tests. Washington (DC): National Education Association; 1955.
Anastasi A, Urbina S. Psychological testing. 7th edn. Upper Saddle River (NJ): Prentice Hall; 1997.
Ashworth B. Preliminary trial of carisoprodol in multiple sclerosis. Practitioner 1964; 192: 5402.[Web of Science][Medline]
Bartko JJ. The intraclass correlation coefficient as a measure of reliability. Psychol Rep 1966; 19: 311.[Web of Science][Medline]
Brook RH, Ware JE Jr, Davies-Avery A, Stewart AL, Donald CA, Rogers WH, et al. Overview of adult health measures fielded in Rand's Health Insurance Study. Med Care 1979; 17 (7 Suppl): 1131.
Brosseau L. The inter-rater reliability and construct validity of the Functional Independence Measure for multiple sclerosis subjects. Clin Rehabil 1994; 8: 10715.
Cella DF, Dineen K, Arnason B, Reder A, Webster KA, Karabatsos G, et al. Validation of the functional assessment of multiple sclerosis quality of life instrument. Neurology 1996; 47: 12939.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960; 20: 3746.[Web of Science]
Collin C, Wade DT, Davies S, Horne V. The Barthel ADL Index: a reliability study. Int Disabil Stud 1988; 10: 613.[Medline]
Cronbach LJ. Essentials of psychological testing. New York: Harper and Row; 1949.
Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951; 16: 297334.[Web of Science]
Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychol Bull 1955; 52: 281302.[Web of Science][Medline]
Cutter GR, Baier ML, Rudick RA, Cookfair DL, Fischer JS, Petkau J, et al. Development of a multiple sclerosis functional composite as a clinical trial outcome measure. Brain 1999; 122: 87182.
Eisen M, Ware JE Jr, Donald CA, Brook RH. Measuring components of children's health status. Med Care 1979; 17: 90221.[Web of Science][Medline]
EuroQoL Group. EuroQoL: a new facility for the measurement of health-related quality of life. Health Policy 1990; 16: 199208.[Web of Science][Medline]
Fitzpatrick R, Ziebland S, Jenkinson C, Mowat A, Mowat A. Transition questions to assess outcomes in rheumatoid arthritis. Br J Rheumatol 1993; 32: 80711.
Fitzpatrick R, Davey C, Buxton MJ, Jones DR. Evaluating patient-based outcome measures for use in clinical trials. [Review]. Health Technol Assess 1998; 2, iiv, 174.[Medline]
Ford HL, Tennant A, Johnson MH. The Leeds MSQoL scale: a disease specific measure of quality of life in multiple sclerosis [abstract]. J Neurol Neurosurg Psychiatry 1997; 62: 210.
Francis DA, Bain P, Swan AV, Hughes RA. An assessment of disability rating scales used in multiple sclerosis. Arch Neurol 1991; 48: 299301.
Freeman JA, Langdon DW, Hobart JC, Thompson AJ. The impact of inpatient rehabilitation on progressive multiple sclerosis. Ann Neurol 1997; 42: 23644.[Web of Science][Medline]
Goldberg DP, Hillier VF. A scaled version of the General Health Questionnaire. Psychol Med 1979; 9: 13945.[Web of Science][Medline]
Goodkin DE, Cookfair D, Wende K, Bourdette D, Pullicino P, Scherokman B, et al. Inter- and intrarater scoring agreement using grades 1.0 to 3.5 of the Kurtzke Expanded Disability Status Scale (EDSS). Neurology 1992; 42: 85963.
Granger CV, Hamilton BB, Keith RA, Zielezny M, Sherwin FS. Advances in functional assessment for medical rehabilitation. Top Geriatric Rehabil 1986; 1: 5974.
Granger CV, Cotter ACR, Hamilton BB, Fiedler RC, Hens MM. Functional assessment scales: a study of persons with multiple sclerosis. Arch Phys Med Rehabil 1990; 71: 8705.[Web of Science][Medline]
Guilford JP. Psychometric methods. New York: McGraw-Hill; 1936.
Guilford JP. Psychometric methods. 2nd edn. New York: McGraw-Hill; 1954.
Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis 1987; 40: 1718.[Web of Science][Medline]
Harwood RH, Ebrahim S. Manual of the London Handicap Scale. Nottingham: Department of Health Care of the Elderly, University of Nottingham; 1995.
Holmes W, Bix B, Shea J. SF-20 score and item distributions in a human immunodeficiency virus-seropositive sample. Med Care 1996; 34: 5629.[Web of Science][Medline]
Howard KI, Forehand GC. A method for correcting itemtotal correlations for the effect of relevant item inclusion. Educ Psychol Meas 1962; 22: 7315.[Web of Science]
IFNB Multiple Sclerosis Study Group. Interferon beta-1b is effective in relapsingremitting multiple sclerosis. I Clinical results of a multicenter, randomized, double-blind, placebo-controlled trial. Neurology 1993; 43: 65561.
Jacobs LD, Cookfair DL, Rudick RA, Herndon RM, Richert JR, Salazar Am, et al. Intramuscular interferon beta-1a for disease progression in relapsing multiple sclerosis. Ann Neurol 1996; 39: 28594.[Web of Science][Medline]
Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989; 27 (3 Suppl): S17889.
Kurtzke JF. A new scale for evaluating disability in multiple sclerosis. Neurology 1955; 5: 5803.
Kurtzke JF. On the evaluation of disability in multiple sclerosis. Neurology 1961; 11: 68694.
Kurtzke JF. Further notes on disability evaluation in multiple sclerosis, with scale modifications. Neurology 1965; 15: 65461.
Kurtzke JF. Neurologic impairment in multiple sclerosis and the disability status scale. Acta Neurol Scand 1970; 46: 493512.[Web of Science][Medline]
Kurtzke JF. Rating neurologic impairment in multiple sclerosis: an expanded disability status scale (EDSS). Neurology 1983; 33: 144452.
Kurtzke JF, Berlin L. The effects of isoniazid on patients with multiple sclerosis. Am Rev Tuberc 1954; 70: 57792.
LaRocca NG, Ritvo PG, Miller DM, Fischer JS, Andrews H, Paty DW. `Quality of life' assessment in multiple sclerosis clinical trials: current status and strategies for improving multiple sclerosis clinical trial design. In: Goodkin DE, Rudick RA, editors. Multiple sclerosis: advances in clinical trial design, treatment and future perspectives. London: Springer; 1996. p. 14560.
Lechner-Scott J, Huber S, Kappos L. Expanded Disability Status Scale (EDSS) training for MS-multicenter-trials [abstract]. J Neurol 1997; 244 Suppl 3: S25.
Lee R, Foley PP. Is the validity of a test constant throughout the test score range? J Appl Psychol 1986; 71: 6414.[Web of Science]
Liang MH, Larson MG, Cullen KE, Schwartz JA. Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum 1985; 28: 5427.[Web of Science][Medline]
Likert RA. A technique for the development of attitudes. Arch Psychol 1932; 140: 555.
Mahoney FI, Barthel DW. Functional evaluation: the Barthel Index. Maryland Med J 1965; 14: 615.
Marolf MV, Vaney C, Konig N, Schenj T, Prosiegel M. Evaluation of disability in multiple sclerosis patients: a comparative study of the Functional Independence Measure, the Extended Barthel Index and the Expanded Disability Status Scale. Clin Rehabil 1996; 10: 30913.
McDowell I, Jenkinson C. Development standards for health measures. J Health Serv Res Policy 1996; 1: 23846.[Medline]
McDowell I, Newell C. Measuring health: a guide to rating scales. New York: Oxford University Press; 1987.
McHorney CA, Tarlov AR. Individual-patient monitoring in clinical practice: are available health status surveys adequate? Qual Life Res 1995; 4: 293307.[Web of Science][Medline]
McHorney CA, Ware JE Jr, Rogers W, Raczek AE, Lu JF. The validity and relative precision of MOS short- and long-form health status scales and Dartmouth COOP charts. Med Care 1992; 30 (5 Suppl): MS25365.
McHorney CA, Ware JE Jr, Raczek AE. The MOS 36-Item Short-Form Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 1993; 31: 24763.[Web of Science][Medline]
McHorney CA, Ware JE Jr, Lu JF, Sherbourne CD. The MOS 36-Item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care 1994; 32: 4066.[Web of Science][Medline]
McIver JP, Carmines EG. Unidimensional scaling. Beverly Hills (CA): Sage; 1981.
Norman GR, Stratford P, Regehr G. Methodological problems in the retrospective computation of responsiveness to change: the lesson of Cronbach. J Clin Epidemiol 1997; 50: 86979.[Web of Science][Medline]
Noseworthy JH, Vandervoort MK, Wong CJ, Ebers GC. Interrater variability with the Expanded Disability Status Scale (EDSS) and Functional Systems (FS) in a multiple sclerosis clinical trial. Neurology 1990; 40: 9715.
Nunnally JC. Introduction to statistics for psychology and education. New York: McGraw-Hill; 1975.
Nunnally JC. Psychometric theory. 2nd edn. New York: McGraw-Hill; 1978.
Poser CM, Paty DW, McDonald WI, Davis FA, Ebers GC, et al. New diagnostic criteria for multiple sclerosis: guidelines for research protocols. Ann Neurol 1983; 13: 22731.[Web of Science][Medline]
Rankin J. Cerebral vascular accidents in patients over the age of 60: II. Prognosis. Scot Med J 1957; 2: 20015.
Scientific Advisory Committee of the Medical Outcomes Trust. Instrument review criteria. Med Outcomes Trust Bull 1995; 3: IIV.
Sharrack B, Hughes RA. Clinical scales for multiple sclerosis. [Review] J Neurol Sci 1996; 135: 19.
Sharrack B, Hughes RA, Soudain S. Guy's Neurological Disability Scale [abstract]. J Neurol 1996a; 243 Suppl 2: S32.
Sharrack B, Hughes RA, Soudain S. Interrater reliability of clinical rating scales used in multiple sclerosis [abstract]. J Neurol Neurosurg Psychiatry 1996b; 61: 2178.
Sharrack B, Hughes RA, Soudain S, Dunn G. The psychometric properties of clinical rating scales used in multiple sclerosis. Brain 1999; 122: 14159.
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 4208.[Web of Science][Medline]
Shumaker SA, Ellis S, Naughton M. Assessing health-related quality of life in HIV disease: key measurement issues. Qual Life Res 1997; 6: 47580.[Web of Science][Medline]
Spector PE. Summated rating scale construction: an introduction. Newbury Park (CA): Sage; 1992.
Stanley JC. Reliability. In: Thorndike RL. Educational measurement. 2nd edn. Washington (DC) American Council on Education; 1971.
Stewart AL, Ware JE Jr, editors. Measuring functioning and well-being: the Medical Outcomes Study approach. Durham (NC): Duke University Press; 1992.
Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use. Oxford: Oxford University Press; 1989.
Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use. 2nd edn. Oxford: Oxford University Press; 1995.
Thurstone LL. Attitudes can be measured. Am J Sociol 1928; 33: 52954.
van Bennekom CA, Jelles F, Lankhorst GJ, Bouter LM. Responsiveness of the Rehabilitation Activities Profile and the Barthel Index. J Clin Epidemiol 1996; 49: 3944.[Web of Science][Medline]
Verdier-Taillefer MH, Zuber M, Lyon-Caen O, Clanet M, Gout O, Louis C, et al. Observer disagreement in rating neurologic impairment in multiple sclerosis: facts and consequences. Eur Neurol 1991; 31: 1179.[Web of Science][Medline]
Veterans Administration Multiple Sclerosis Study Group. Isoniazid in treatment of multiple sclerosis. J Am Med Assoc 1957; 163: 16872.[Web of Science]
Vickrey BG, Hays RD, Harooni R, Myers LW, Ellison GW. A health-related quality of life measure for multiple sclerosis. Qual Life Res 1995; 4: 187206.[Web of Science][Medline]
Wade DT. Measurement in neurological rehabilitation. Oxford: Oxford University Press; 1992.
Ware JE Jr. SF-36 Health Survey manual and interpretation guide. Boston (MA): Nimrod Press; 1993.
Ware JE Jr, Davies-Avery A, Donald CA. Conceptualisation and measurement of health for adults in the health insurance study: Vol. V: general health perceptions. Santa Monica (CA): The Rand Corporation; 1978.
Ware JE Jr, Brook RH, Davies-Avery A, Williams KN, Stewart AL, Rogers WH, et al. Conceptualisation and measurement of health for adults in the health insurance study: Vol. I: model of health and methodology. Santa Monica (CA): The Rand Corporation; 1980.
Ware JE Jr, Kosinski MA, Keller SD. SF-36 physical and mental health summary scales: a user's manual. Boston (MA): Health Institute, New England Medical Center; 1994.
Ware JE Jr, Harris WJ, Gandek B, Rogers BW, Reeses PR. MAP-R for windows: multitrait/multi-item analysis programrevised user's guide. Boston (MA): Health Assessment Laboratory; 1997.
Weinshenker BG, Issa M, Baskerville J. Meta-analysis of the placebo-treated groups in clinical trials of progressive MS. Neurology 1996; 46: 16139.
Whitaker JN, McFarland HF, Rudge P, Reingold SC. Outcomes assessment in multiple sclerosis clinical trials: a critical analysis. Mult Scler 1995; 1: 3747.
Williams JI, Naylor CD. How should health status measures be assessed? Cautionary notes on procrustean frameworks. J Clin Epidemiol 1992; 45: 134751.[Web of Science][Medline]
Willoughby EW, Paty DW. Scales for rating impairment in multiple sclerosis: a critique. Neurology 1988; 38: 17938.
Received July 7, 1999. Revised October 15, 1999. Accepted November 15, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
F. E Munschauer III, R. H. Benedict, C. V Granger, and P. M Niewczyk Introduction to Best practice recommendations for the selection and management of patients with MS on natalizumab Multiple Sclerosis, November 1, 2009; 15(4_suppl): S1 - S6. [PDF] |
||||
![]() |
J. F. Foley and D. W. Brandes Redefining functionality and treatment efficacy in multiple sclerosis Neurology, June 9, 2009; 72(23_Supplement_5): S1 - S11. [Abstract] [Full Text] [PDF] |
||||
![]() |
A Minneboo, B. Uitdehaag, P Jongen, H Vrenken, D. Knol, M. van Walderveen, C. Polman, J. Castelijns, and F Barkhof Association between MRI parameters and the MS severity scale: a 12 year follow-up study Multiple Sclerosis, May 1, 2009; 15(5): 632 - 637. [Abstract] [PDF] |
||||
![]() |
N. Koch-Henriksen No shortcuts to outcome in MS clinical trials? Neurology, February 24, 2009; 72(8): 686 - 687. [Full Text] [PDF] |
||||
![]() |
O. Gray, G. McDonnell, and S. Hawkins Tried and tested: the psychometric properties of the multiple sclerosis impact scale (MSIS-29) in a population-based study Multiple Sclerosis, January 1, 2009; 15(1): 75 - 80. [Abstract] [PDF] |
||||
![]() |
B Brochet, M. Deloire, M Bonnet, E Salort-Campana, J. Ouallet, K. Petry, and V Dousset Should SDMT substitute for PASAT in MSFC? A 5-year longitudinal study Multiple Sclerosis, November 1, 2008; 14(9): 1242 - 1249. [Abstract] [PDF] |
||||
![]() |
J. J. Kragt, A. J. Thompson, X. Montalban, M. Tintore, J. Rio, C. H. Polman, and B.M.J. Uitdehaag Responsiveness and predictive value of EDSS and MSFC in primary progressive MS Neurology, March 25, 2008; 70(13_Part_2): 1084 - 1091. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z Khaleeli, J Sastre-Garriga, O Ciccarelli, D H Miller, and A J Thompson Magnetisation transfer ratio in the normal appearing white matter predicts progression of disability over 1 year in early primary progressive multiple sclerosis J. Neurol. Neurosurg. Psychiatry, October 1, 2007; 78(10): 1076 - 1082. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Rampello and A. Chetta Author Response Physical Therapy, May 1, 2007; 87(5): 558 - 559. [Full Text] [PDF] |
||||
![]() |
M. Juha, S. Leszek, F. Sten, H. Jan, B. Jakob, F. Olof, and K. W. Maria Progression of non-age-related callosal brain atrophy in multiple sclerosis: a 9-year longitudinal MRI study representing four decades of disease development J. Neurol. Neurosurg. Psychiatry, April 1, 2007; 78(4): 375 - 380. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Hensiek, S. R. Seaman, L. F. Barcellos, A. Oturai, M. Eraksoi, E. Cocco, L. Vecsei, G. Stewart, B. Dubois, J. Bellman-Strobl, et al. Familial effects on the clinical course of multiple sclerosis Neurology, January 30, 2007; 68(5): 376 - 383. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. de Groot, H. Beckerman, B. M. J. Uitdehaag, H. C. W. de Vet, G. J. Lankhorst, C. H. Polman, and L. M. Bouter The usefulness of evaluative outcome measures in patients with multiple sclerosis Brain, October 1, 2006; 129(10): 2648 - 2659. [Abstract] [Full Text] [PDF] |
||||
![]() |
M M Nieuwenhuis, H Van Tongeren, P S Sorensen, and M Ravnborg The Six Spot Step Test: a new measurement for walking ability in multiple sclerosis Multiple Sclerosis, August 1, 2006; 12(4): 495 - 500. [Abstract] [PDF] |
||||
![]() |
V de Groot, H Beckerman, G J Lankhorst, C H Polman, and L M Bouter The initial course of daily functioning in multiple sclerosis: a three-year follow-up study Multiple Sclerosis, December 1, 2005; 11(6): 713 - 718. [Abstract] [PDF] |
||||
![]() |
D A Gruenewald, I J Higginson, B Vivat, P Edmonds, and R E Burman Quality of life measures for the palliative care of people severely affected by multiple sclerosis: a systematic review Multiple Sclerosis, December 1, 2004; 10(6): 690 - 725. [Abstract] [PDF] |
||||
![]() |
O.R. Pearson, M.E. Busse, R.W.M. van Deursen, and C.M. Wiles Quantification of walking mobility in neurological disorders QJM, August 1, 2004; 97(8): 463 - 475. [Full Text] [PDF] |
||||
![]() |
C. McGuigan and M. Hutchinson Confirming the validity and responsiveness of the Multiple Sclerosis Walking Scale-12 (MSWS-12) Neurology, June 8, 2004; 62(11): 2103 - 2105. [Abstract] [Full Text] [PDF] |
||||
![]() |
C McGuigan and M Hutchinson The multiple sclerosis impact scale (MSIS-29) is a reliable and sensitive measure J. Neurol. Neurosurg. Psychiatry, February 1, 2004; 75(2): 266 - 269. [Abstract] [Full Text] [PDF] |
||||
![]() |
J Hobart, N Kalkers, F Barkhof, B Uitdehaag, C Polman, and A Thompson Outcome measures for multiple sclerosis clinical trials: relative measurement precision of the Expanded Disability Status Scale and Multiple Sclerosis Functional C omposite Multiple Sclerosis, February 1, 2004; 10(1): 41 - 46. [Abstract] [PDF] |
||||
![]() |
G. T. Ingle, V. L. Stevenson, D. H. Miller, and A. J. Thompson Primary progressive multiple sclerosis: a 5-year clinical and MR study Brain, November 1, 2003; 126(11): 2528 - 2536. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. M Gold, H. Schulz, A. Monch, K.-H. Schulz, and C. Heesen Cognitive impairment in multiple sclerosis does not affect reliability and validity of self-report health measures Multiple Sclerosis, August 1, 2003; 9(4): 404 - 410. [Abstract] [PDF] |
||||
![]() |
A Riazi, J C Hobart, D L Lamping, R Fitzpatrick, and A J Thompson Evidence-based measurement in multiple sclerosis: the psychometric properties of the physical and psychological dimensions of three quality of life rating scales Multiple Sclerosis, August 1, 2003; 9(4): 411 - 419. [Abstract] [PDF] |
||||
![]() |
E L J Hoogervorst, M J Eikelenboom, B M J Uitdehaag, and C H Polman One year changes in disability in multiple sclerosis: neurological examination compared with patient self report J. Neurol. Neurosurg. Psychiatry, April 1, 2003; 74(4): 439 - 442. [Full Text] [PDF] |
||||
![]() |
J Lechner-Scott, L Kappos, M Hofman, C H Polman, H Ronner, X Montalban, M Tintore, M Frontoni, C Buttinelli, M P Amato, et al. Can the Expanded Disability Status Scale be assessed by telephone? Multiple Sclerosis, April 1, 2003; 9(2): 154 - 159. [Abstract] [PDF] |
||||
![]() |
M W Nortvedt and T Riise The use of quality of life measures in multiple sclerosis research Multiple Sclerosis, February 1, 2003; 9(1): 63 - 72. [Abstract] [PDF] |
||||
![]() |
J. C. Hobart, A. Riazi, D. L. Lamping, R. Fitzpatrick, and A. J. Thompson Measuring the impact of MS on walking ability: The 12-Item MS Walking Scale (MSWS-12) Neurology, January 14, 2003; 60(1): 31 - 36. [Abstract] [Full Text] [PDF] |
||||
![]() |
A Riazi, J C Hobart, D L Lamping, R Fitzpatrick, and A J Thompson Multiple Sclerosis Impact Scale (MSIS-29): reliability and validity in hospital based samples J. Neurol. Neurosurg. Psychiatry, December 1, 2002; 73(6): 701 - 704. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J Thompson Developing clinical outcome measures in multiple sclerosis: an evolving process Multiple Sclerosis, October 1, 2002; 8(5): 357 - 358. [PDF] |
||||
![]() |
R A Rudick, G Cutter, and S Reingold The Multiple Sclerosis Functional Composite: a new clinical outcome measure for multiple sclerosis trials Multiple Sclerosis, October 1, 2002; 8(5): 359 - 365. [Abstract] [PDF] |
||||
![]() |
F. Bonneville, D. M. Moriarty, B. S.Y. Li, J. S. Babb, R. I. Grossman, and O. Gonen Whole-Brain N-Acetylaspartate Concentration: Correlation with T2-Weighted Lesion Volume and Expanded Disability Status Scale Score in Cases of Relapsing-Remitting Multiple Sclerosis AJNR Am. J. Neuroradiol., March 1, 2002; 23(3): 371 - 375. [Abstract] [Full Text] [PDF] |
||||
![]() |
E M Cheng, R D Hays, L W Myers, G W Ellison, M Beckstrand, and B G Vickrey Factors related to agreement between self-reported and conventional Expanded Disability Status Scale (EDSS) scores Multiple Sclerosis, December 1, 2001; 7(6): 405 - 410. [Abstract] [PDF] |
||||
![]() |
E L. Hoogervorst, N F Kalkers, L M. van Winsen, B M. Uitdehaag, and C H Polman Differential treatment effect on measures of neurologic exam, functional impairment and patient self-report in multiple sclerosis Multiple Sclerosis, October 1, 2001; 7(5): 335 - 339. [Abstract] [PDF] |
||||
![]() |
J Hobart, J Freeman, D Lamping, R Fitzpatrick, and A Thompson The SF-36 in multiple sclerosis: why basic assumptions must be tested J. Neurol. Neurosurg. Psychiatry, September 1, 2001; 71(3): 363 - 370. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Freeman, J. C. Hobart, and A. J. Thompson Does adding MS-specific items to a generic measure (the SF-36) improve measurement? Neurology, July 10, 2001; 57(1): 68 - 74. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Bowen, L. Gibbons, A. Gianas, and G. H Kraft Self-administered Expanded Disability Status Scale with functional system scores correlates well with a physician-administered test Multiple Sclerosis, June 1, 2001; 7(3): 201 - 206. [Abstract] [PDF] |
||||
![]() |
J. A. Cohen, G. R. Cutter, J. S. Fischer, A. D. Goodman, F. R. Heidenreich, A. J. Jak, J. E. Kniker, M. F. Kooijmans, J. M. Lull, A. W. Sandrock, et al. Use of the Multiple Sclerosis Functional Composite as an Outcome Measure in a Phase 3 Clinical Trial Arch Neurol, June 1, 2001; 58(6): 961 - 967. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hobart, D. Lamping, R. Fitzpatrick, A. Riazi, and A. Thompson The Multiple Sclerosis Impact Scale (MSIS-29): A new patient-based outcome measure Brain, May 1, 2001; 124(5): 962 - 973. [Abstract] [Full Text] [PDF] |
||||
![]() |
E.L.J. Hoogervorst, L.M.L. van Winsen, M.J. Eikelenboom, N.F. Kalkers, B.M.J. Uitdehaag, and C.H. Polman Comparisons of patient self-report, neurologic examination, and functional impairment in MS Neurology, April 10, 2001; 56(7): 934 - 937. [Abstract] [Full Text] [PDF] |
||||
![]() |
S M Gold, C Heesen, H Schulz, U Guder, A Monch, J Gbadamosi, C Buhmann, and K H Schulz Disease specific quality of life instruments in multiple sclerosis: Validation of the Hamburg Quality of Life Questionnaire in Multiple Sclerosis (HAQUAMS) Multiple Sclerosis, April 1, 2001; 7(2): 119 - 130. [Abstract] [PDF] |
||||
![]() |
A. J. Manson, H. Hanagasi, K. Turner, P. N. Patsalos, P. Carey, N. Ratnaraj, and A. J. Lees Intravenous apomorphine therapy in Parkinson's disease: Clinical and pharmacokinetic observations Brain, February 1, 2001; 124(2): 331 - 340. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J Thompson Neurological rehabilitation: from mechanisms to management J. Neurol. Neurosurg. Psychiatry, December 1, 2000; 69(6): 718 - 722. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








