OUP user menu

Getting the measure of spasticity in multiple sclerosis: the Multiple Sclerosis Spasticity Scale (MSSS-88)

J. C. Hobart, A. Riazi, A. J. Thompson, I. M. Styles, W. Ingram, P. J. Vickery, M. Warner, P. J. Fox, J. P. Zajicek
DOI: http://dx.doi.org/10.1093/brain/awh675 224-234 First published online: 9 November 2005


Spasticity is most commonly defined as an inappropriate, velocity dependent, increase in muscle tonic stretch reflexes, due to the amplified reactivity of motor segments to sensory input. It forms one component of the upper motor neuron syndrome and often leads to muscle stiffness and disability. Spasticity can, therefore, be measured through electrophysiological, biomechanical and clinical evaluation, the last most commonly using the Ashworth scale. None of these techniques incorporate the patient experience of spasticity, nor how it affects people's daily lives. Consequently, we set out to construct a rating scale to quantify the perspectives of the impact of spasticity on people with multiple sclerosis. Qualitative methods (in-depth patient interviews and focus groups, expert opinion and literature review) were used to develop a conceptual framework of spasticity impact, and to generate a pool of items with the potential to convert this framework into a rating scale with multiple dimensions. This item pool was administered, in the form of a questionnaire, to a sample of people with multiple sclerosis and spasticity. Guided by Rasch analysis, we constructed and validated a rating scale for each component of the conceptual framework. Decisions regarding item selection were based on the integration and assimilation of seven specific analyses including clinical meaning, ordering of thresholds, fit statistics and differential item functioning. The qualitative phase (17 patient interviews, 3 focus groups) generated 144 potential scale items and a conceptual model with eight components addressing symptoms (muscle stiffness, pain and discomfort and muscle spasms,), physical impact (activities of daily living, walking and body movements) and psychosocial impact (emotional health, social functioning). The first postal survey was sent to 272 people with multiple sclerosis and had a response rate of 88%. Findings supported the development of scales for each component but demonstrated that five item response options were too many. The 144-item questionnaire, reformatted with four-item response options, was administered with four validating instruments to an independent sample of 259 people with multiple sclerosis (response rate 78%). From the responses, an 88-item instrument with eight subscales was developed that satisfied criteria for reliable and valid measurement. Correlations with other measures were consistent with predictions. The 88-item Multiple Sclerosis Spasticity Scale (MSSS-88) is a reliable and valid, patient-based, interval-level measure of the impact of spasticity in multiple sclerosis. It has the potential to advance outcomes measurement in clinical trials and clinical practice, and provides a new perspective in the clinical evaluation of spasticity.

  • spasticity measurement
  • multiple sclerosis
  • Multiple Sclerosis Spasticity Scale (MSSS-88)
  • quality of life measurement
  • Rasch analysis
  • ADL = activities of daily living
  • FAMS = Functional Assessment in Multiple Sclerosis
  • MSSS-88 = Multiple Sclerosis Spasticity Scale
  • MSIS-29 = Multiple Sclerosis Impact Scale


Spasticity is common, clinically and pathophysiologically complex, and disabling. It affects at least 35% of people post-stroke (Watkins et al., 2002) and up to 90% of people with multiple sclerosis at some point (Paty and Ebers, 1998). A range of treatments is available including spasticity reduction strategies, specialist rehabilitation therapy, oral medications, intramuscular and intrathecal injections, intrathecal infusions and surgery. Problematic spasticity typically requires a combination of treatments (Crayton et al., 2004), and should involve a patient-focused, co-ordinated, multidisciplinary team approach (Thompson et al., 2005). These facts emphasize that scientifically sound and clinically meaningful spasticity measurement is indispensable to clinical practice and research in this area (Voerman et al., 2005).

Spasticity measurement, like spasticity management, is complicated. In broad terms, measurement instruments can be categorized into neurophysiological methods (Voerman et al., 2005), biomechanical techniques (Wood et al., 2005) and clinical scales (Platz et al., 2005). The clinical meaningfulness of neurophysiological and biomechanical approaches has been questioned, as they focus on highly specific examinations (e.g. H-reflex or single joint analysis), correlate poorly with clinical indicators of spasticity and have problems with reliability and sensitivity (Voerman et al., 2005; Wood et al., 2005). Clinical scales used in the measurement of spasticity have also been found wanting (Platz et al., 2005). Of the 24 scales recently reviewed, (Platz et al., 2005) 18 were single item measures and, as a consequence, have poor reliability (McHorney et al., 1992), validity (Manning et al., 1982; Hobart, 2003) and responsiveness (Sloan et al., 2002). Only three scales had more than three items: two of these assessed resistance to passive movement, the third measured the extensor toe sign. No scale had been developed to address the broader consequences of spasticity for the patient.

If spasticity management is to be patient-focused, clinical trials and clinical practice need rigorous measurement methods that capture patients' experiences and perceptions of spasticity, and complement the existing range of measures. That challenge, which has not been met by existing scales (Platz et al., 2005), was the aim of this study.



There were three stages. First, we used a range of qualitative studies to develop a conceptual framework of spasticity impact, and a pool of potential items hypothesized to convert this framework into a scale. Second, we administered the items, as a questionnaire, to a sample of people with multiple sclerosis and spasticity and, using Rasch analysis, undertook the preliminary steps of constructing a subscale for each component of the conceptual framework. Third, we undertook a second survey to finalize and validate the instrument. The research ethics committees of Derriford Hospital and the National Hospital for Neurology and Neurosurgery (NHNN) approved the study.

Stage 1: conceptual model formation and item generation

Four pieces of qualitative work were undertaken to develop a conceptual framework of spasticity impact and to generate a pool of items with the potential to convert (operationalize) this framework into a scale with multiple subscales. First, in-depth, semi-structured interviews were conducted with individual multiple sclerosis patients from NHNN, until no new themes emerged. Second, three in-depth semi-structured focus groups were conducted with multiple sclerosis patients from Derriford Hospital. Patients were chosen to ensure a wide variance of spasticity severity, age, sex, disease duration and disease type. Interviews and focus groups were tape-recorded, transcribed and content analysed (WINMAX; Kuckartz, 1996). Third, a comprehensive literature review was undertaken to identify relevant health areas and potential items. Lastly, expert opinion on the impact of spasticity was sought from neurologists, spasticity nurses, multiple sclerosis nurses and rehabilitation staff.

A preliminary questionnaire was formatted and pre-tested in a small group of patients with multiple sclerosis and variable degrees of spasticity.

Stage 2: first postal survey

The questionnaire was posted to a random half-sample of the 544 patients from the Cannabinoids in Multiple Sclerosis study (CAMS; Zajicek et al., 2003) who had commenced trial medication and were still under follow-up. To encourage high response rates we used personalized letters, standardized instructions and reminders for non-responders at 3 and 5 weeks.

Analysis plans

Scale development was guided by Rasch measurement principles (Rasch, 1960) and analyses (Andrich et al., 1997–2004). The key principle is that the mathematical (Rasch) model articulates a set of requirements that must be met for rating scale data to generate internally valid, equal-interval measurements that are stable (invariant) across items and people. In contrast, scales whose development is guided by traditional psychometric methods generate ordinal scores whose invariance is unknown (Wright and Linacre, 1989).

We constructed a scale for each area defined as important to patients by the qualitative studies. The aim was that each scale consisted of a set of clinically meaningful items that satisfied requirements for measurement. This goal was achieved by choosing a set of items hypothesized to constitute a scale for each area, analysing the observed data against measurement criteria and making decisions on item selection and deletion. Appraisals according to these criteria were not conducted singularly and sequentially, but simultaneously and interactively within the specific context of the item set being examined. The seven measurement criteria were:

Clinical meaning. We examined all items in each set to judge the extent to which they were clinically cohesive. Items deemed non-specific were considered for deletion.

Thresholds for item response options. For each item, the use of response categories scored with successive integer scores (1 = not at all to 5 = extremely) implies a continuum of increasing impact, from less (not at all) to more (extremely). This assumption was tested by examining the ordering of thresholds (or points of crossover between two adjacent response categories) ascertained by the Rasch analysis (Andrich, 1978). A threshold is the point on the measurement continuum defined by a scale (e.g. degree of muscle stiffness) at which the probability of responding to adjacent categories (e.g. ‘not at all’ and ‘a little’) is equal. Disordered thresholds imply scoring functions that are not working as intended (Andrich, 1978). Such items were considered for deletion.

Item fit statistics. Rasch analysis tests the extent to which the observed data (patients' responses to items) accord with (fit) the responses expected by a mathematical (Rasch) model. Misfit implies an item is not working as intended in a scale, and may be regarded as not measuring the construct under consideration. There are many methods of examining the fit of data to the model, no method alone is sufficient to make a judgement about fit. We examined three indicators. First, log residuals that summarize the difference between observed and expected responses to an item across all people (item–person interaction). Second, chi square values that summarize the difference between observed and expected responses to an item for class intervals of people who have relatively similar levels of disability (item–trait interaction). Third, item characteristic curves (ICC) that display graphically the expected responses across the continuum of person scores and the observed values for each class interval of person scores. There are no absolute criteria for interpreting fit statistics. It is more meaningful to interpret them together, and in light of the clinical usefulness of an item set.

Item locations. The items of a scale define the continuum on which people are measured. Rasch analysis locates items and people on this continuum. Ideally, and logically, items should be evenly spread over a reasonable range and targeted to the people they are measuring. Items with similar locations on the continuum indicate that one of them might be redundant.

Differential item functioning (DIF). Stable measurement rulers are required for people to be measured precisely and validly (Linacre et al., 1994). That is, the items of a scale are required to perform similarly across different groups of people. More specifically, for any given level of disability the expected value of an item is required to be the same irrespective of which group a person belongs to. We examined all items for the extent to which their functioning was differentially affected by gender, age, mobility level (unaided, with aid and wheelchair user) and degree of spasticity (self-reported as minimal, mild, moderate or severe). Items demonstrating DIF, determined by statistical (ANOVA) and visual (ICC) tests, were considered for deletion (Hagquist and Andrich, 2004).

Correlations between standardized residuals. A residual is the difference between the observed and expected response for a person to an item. A standardized residual is computed by squaring and summing all residuals for an item and dividing this value by its standard deviation. Correlations between residuals assess the extent to which the response to one item is biased by the response to another. The significance of these values depends on sample size and item number. Values of >0.30 imply dependency among items and were used to identify items for evaluation (Andrich, 1988).

Person separation index (PSI). This reliability statistic, analogous to Cronbach's alpha (Andrich, 1982), quantifies the error associated with the measurements of people in this sample. Higher values indicate greater reliability. When items were deleted the impact on reliability was determined.

Stage 3: second postal survey


The remaining random half sample from the CAMS study cohort was surveyed, excluding 13 local people participating in another study. This sample was divided into random half samples that received booklets containing the new spasticity scale, a self-report spasticity grading (0 = minimal; 1 = mild; 2 = moderate; 3 = severe), demographic questions, but different validating scales. Booklet 1 contained the Multiple Sclerosis Impact Scale (MSIS-29; Hobart et al., 2001b) and the physical functioning (SF36PF) and mental health (SF36MH) subscales from the Short Form Health Survey (SF-36; Ware et al., 1993). Booklet 2 contained the mobility (FAMSmob) and emotional well-being (FAMSewb) subscales of the Functional Assessment of multiple sclerosis (FAMS; Cella et al., 1996), postal Barthel Index (BI; Gompertz et al., 1994) and General Health Questionnaire (GHQ-12; Goldberg and Hillier, 1979). Standard survey methods were used.

Analysis plans

All analyses described above were repeated. In addition, internal construct validity was examined by computing intercorrelations among subscales of the new spasticity instrument and by determining the ability of the subscales to detect differences between groups defined by their self-report spasticity grading. Convergent and discriminant construct validity was examined by determining the extent to which correlations between the new spasticity instrument and validating variables were consistent with expectation. These methods are described elsewhere (Hobart et al., 1996; Hobart et al., 2001a; Scientific Advisory Committee of the Medical Outcomes Trust, 2002).


Stage 1: conceptual model formation and item generation

Seventeen interviews (75% female; mean age 47 years) were conducted until no new information was extracted. There were three focus groups (71% female, mean age 54 years) that included a total of 14 people. Expert opinion was canvassed from neurologists, multiple sclerosis nurses, spasticity nurses and rehabilitation therapists. Content analysis of the interview and focus group transcripts generated ∼2000 statements concerning the impact of spasticity. These were extracted, grouped into main themes and examined for redundancy.

This qualitative work generated a preliminary conceptual model of spasticity impact and, on the basis of that model, a preliminary questionnaire with 144 items was developed. Three main domains (symptoms, physical functioning and psychosocial functioning) were identified, with a total of 8 subscales: muscle stiffness (19 items); pain and discomfort (10 items); muscle spasms (23 items); activities of daily living (ADL) (14 items); body movements (21 items); walking (15 items); emotional health (26 items) and social functioning (16 items). All items were given the same five-point response options: 1 = not at all bothered; 2 = a little bothered; 3 = moderately bothered; 4 = quite a bit bothered and 5 = extremely bothered.

Items were pre-tested in an independent sample of 17 out-patients and in-patients (NHNN) with varying levels of spasticity. Appropriate modifications were made and demographic questions were included in the booklet. At this early stage, all 144 items were retained and put into the most clinically appropriate grouping even though a number of items were considered non-specific indicators of that construct. For example, we put the item ‘bothered by heaviness anywhere in your limbs’ in the subscale concerning muscle stiffness, although we were unsure that it would be part of the final operationalization of that construct.

Stage 2: first postal survey


Questionnaire booklets were sent to 272 people, and 240 were returned completed (conservative response rate 88%). Table 1 shows the respondents' characteristics.

View this table:
Table 1

Characteristics of survey samples

CharacteristicFirst postal surveySecond postal survey
Booklets sent272259
Completed booklets returned: n (%)240 (88)202 (78)
    Booklet 1N/A98 (48.5%)
    Booklet 2N/A104 (51.1%)
    Female: n (%)164 (68%)121 (63%)
    Mean (SD); range53 (7.6); 32–6754 (7.0) 35–68
Duration of multiple sclerosis (since onset, patient report)
    Mean (SD); range18 (8.9); 5–5021 (8.6); 5–44
Indoor mobility
    UnaidedN/A10 (5.4%)
    AidN/A78 (42.2%)
    WheelchairN/A97 (52.4%)
Self-reported degree of spasticity
    Minimal10 (7.6%)22 (11%)
    Mild7 (5.3%)24 (13%)
    Moderate35 (26.5%)60 (32%)
    Severe80 (60.6%)83 (44%)
Ashworth score
    Mean (SD); range22.2 (9.9); 2–5219.4 (11.1); 1–57
  • N/A = not applicable as question not asked.

  • Ashworth score is the last score recorded in the CAMS study for that patient.

Rasch analysis

The main finding was that empirical analysis using the Rasch measurement model did not support the five-point item response option. Most items (132 of 144) had disordered thresholds implying the scoring function was not working as anticipated. The category probability curves (CPC), which plot subscale scores on the x-axis against the probability of endorsing each item response category on the y-axis, suggested the main reason for disordering was that patients could not discriminate reliably between the five response options. In particular, people appeared unable to reliably distinguish ‘a little’ from ‘moderately’, and ‘moderately’ from ‘quite a bit’.

Given this finding we undertook a post hoc analysis. First, we examined the effect of reducing the response options from five to four by combining ‘moderately’ with either ‘a little’ or ‘quite a bit’, as suggested by each item's CPC. This left seven items with disordered thresholds. With items re-scored in this manner, preliminary Rasch analyses of the hypothesized item groups were performed and supported the feasibility of constructing valid subscales for the eight components. However, post hoc analyses make assumptions about how people would have responded if a category had not been available. Therefore, in stage 3, we repeated the 144-item survey in an independent sample with a four-point item response option (1 = not at all; 2 = a little; 3 = moderately and 4 = extremely).

Stage 3: second postal survey

Sample characteristics

Questionnaire booklets were sent to 259 people. Random half samples received booklets 1 and 2. A total of 202 people returned completed questionnaire booklets (78% response rate). Table 1 shows their characteristics. In essence, this was an older sample of people with multiple sclerosis with moderate-long disease duration, half were wheelchair users, and most reported their spasticity to be moderate (32%) or severe (44%).

Scale development

The final decision as to which items should remain in each subscale was determined by assimilating the information from all seven criteria defined in the methods. A total of 56 items were deleted (mean per scale = 7; range 1–13). For example, 23 items were considered for inclusion in the muscle spasms subscale. Four of these items were eliminated because they were considered to be non-specific indicators of a continuum, from less to more, of the degree of muscle spasms. These items were: ‘juddering/jolting related to spasms’; ‘feet or legs bouncing up and down’; ‘spasms leading to difficulties greeting people with a handshake’ and ‘spasms leading to difficulty giving hugs’.

The remaining 19 items were entered into a Rasch analysis. Four of these items had reversed thresholds (‘spasms waking up your partner’; ‘spasms that are difficult to stop’; ‘spasms provoked by temperature change’ and ‘feeling that your knees are stuck together’) indicating that the four-point response option was not working as intended for these items. Although these items appear clinically important, they were removed because of the reversed thresholds and because other items in the set had similar locations and, therefore, they could be regarded as redundant in measurement terms. Another item, ‘spasms when transferring’, demonstrated DIF in different mobility groups. That is, this item had a different meaning for people with different levels of mobility (even though they had the same total score on the subscale) and was therefore unstable in measurement terms. This item was removed because of this problem, and also because it had a location similar to other items in the set, and could be regarded as redundant. The remaining 14 items appeared to constitute a clinically meaningful set, relating to muscle spasms, and satisfied the pre-determined criteria as a measurement instrument. Details of the complete instrument development process are available from the authors.

Scale validation

Tables 25 show for all subscales the item locations, standard errors and fit statistics (fit residuals and chi square values), and the subscale reliabilities. For each subscale, the item locations spread across a reasonable range of their continua, the standard errors were small, almost all log residuals lay within the recommended range of −2.5 to +2.5 and chi squared statistics were small. All person separation indices were high (≥0.92). These findings support the reliability and validity of each MSSS-88 subscale. Table 8 shows the distributions of person measurements for each subscale. Scores spanned the full subscale range and floor and ceiling effects were less than the recommended maximum of 20% (McHorney and Tarlov, 1995). However, the three physical functioning subscales had larger floor effects than the other subscales (range 11–19.8%).

View this table:
Table 2

MSSS-88: muscle stiffness and pain and discomfort scales

ItemAbbreviated item labelLocationSEFit statistics
ResidualChi square
Muscle stiffness (reliability* = 0.95)
1When walking−
2Anywhere—lower limbs−0.920.12−1.739.15
3Same position long time−0.640.111.532.48
4First thing in morning−0.470.11−0.111.57
5Tightness anywhere in lower limbs−0.460.11−0.281.03
6Lower limbs feeling rigid−0.410.11−2.525.19
7When standing up−
8Tightness in muscles0.140.12−0.250.57
9That is unpredictable0.650.11−1.853.78
10Feeling muscles pulling0.770.113.113.76
11In your whole body1.330.111.301.06
12Whole body feeling rigid1.380.110.610.60
Pain and discomfort (reliability 0.95)
1Restricted and uncomfortable−0.920.120.570.17
2Uncomfortable sitting-long time−0.510.12−0.600.50
3Painful/uncomfortable spasms−0.440.11−0.281.36
4Pain—when in same position for too long−0.030.11−0.892.17
5Uncomfortable lying down for long time0.040.11−0.041.06
6Difficulties—comfortable position to sleep0.070.111.340.53
7Pain—muscles—getting out of bed-morning0.340.11−0.640.87
8Pain—muscles provoked by movement0.360.11−0.680.68
9Constant pain in muscles1.080.110.310.45
  • * Person separation index = true variance (total − error variance)/total variance.

View this table:
Table 3

MSSS-88: muscle spasms and ADL scales

ItemAbbreviated item labelLocationSEFit statistics
ResidualChi square
Muscle spasms (reliability 0.93)
1Spasms—come on unpredictably−0.650.110.482.20
2Powerful or strong spasms−0.560.10−1.983.50
3Spasms—first getting out of bed—morning−0.460.10−0.531.50
4Spasms provoked by changing positions−0.420.11−2.687.28
5Provoked by movement−0.250.11−1.779.10
6Where your legs kick out in front of you−
7Provoked by certain positions−0.210.10−0.463.02
8Spasms disturbing sleep−
9When doing certain tasks0.090.10−1.907.50
10When travelling over bumps or cobbles0.200.102.497.32
11Where your knees pull up0.280.101.975.76
12Causing legs to hit things0.650.101.193.89
13Provoked by touch0.660.101.190.32
14Pushing you out of chair or wheelchair0.940.11−0.860.81
Activities of Daily Living (ADL; reliability 0.95)
1Putting on your socks or shoes−0.730.14−0.542.20
2Housework such as cooking/cleaning−0.730.142.131.48
3Getting in and out of a car−0.640.14−1.702.77
4Getting in and out of shower/bath−0.410.140.871.01
5Sitting up in bed−0.400.142.311.34
6Getting into or out of bed0.120.14−2.084.77
7Turning over in bed0.
8Getting into or out of a chair0.320.14−1.966.95
9Getting dressed or undressed0.460.14−1.521.38
10Getting on or off the toilet seat0.730.14−0.852.44
11Drying yourself with a towel1.
View this table:
Table 4

MSSS-88: walking and body movements scales

ItemAbbreviated item labelLocationSEFit statistics
ResidualChi square
Walking (reliability 0.96)
1Difficulties walking smoothly−0.920.19−1.624.56
2Being slow when walking−0.860.19−0.430.81
3Having to concentrate on your walking−0.740.19−1.066.20
4Having to increase effort to walk−0.620.18−2.343.17
5Being slow going up and down stairs−0.310.172.843.99
6Being clumsy when walking−0.170.18−0.942.38
7Tripping over/stumbling when walking−
8Feeling like walking through treacle0.760.170.390.78
9Losing your confidence to walk1.020.160.413.05
10Feeling embarrassed to walk1.850.151.426.36
Body movement (reliability 0.96)
1Difficulties moving freely−0.960.14−2.596.25
2Difficulties moving smoothly−0.890.14−1.423.76
3Limited range of movement−0.600.13−1.693.54
4Difficulties moving parts of your body−0.550.13−0.431.19
5Difficulties bending your limbs−
6Your body being resistant to movement0.090.13−1.422.87
7Your body or limbs feeling locked0.360.120.680.55
8Awkward or jerky movement0.390.132.523.31
9Difficulties straightening your limbs0.470.131.482.95
10Difficulties relaxing parts of your body0.500.131.080.09
11No control over your body1.190.130.760.66
View this table:
Table 5

MSSS-88: emotional health and social functioning scales

ItemAbbreviated item labelLocationSEFit statistics
ResidualChi square
Emotional health (reliability 0.96)
1Feeling frustrated−1.550.120.100.63
2Feeling less confident in yourself−0.680.111.911.89
3Feeling inadequate−0.500.10−0.900.90
4Feeling low−0.450.11−1.172.57
5Feeling irritated−0.320.11−1.711.83
6Feeling angry−
7Feeling depressed0.040.11−0.762.85
8Loss of self-worth0.120.11−1.734.48
9Feeling like a failure0.360.11−0.131.29
10Feeling frightened0.470.11−0.961.83
11Crying (tearful)0.750.111.431.53
12Feeling panicky0.860.11−0.140.07
13Feeling nervous0.980.120.724.36
Social functioning (reliability 0.95)
1Difficulties going out−0.690.100.850.80
2Feeling isolated−0.550.100.930.39
3Feeling vulnerable−0.320.102.534.18
4Difficulties finding energy for others−
5Feeling reluctant to go out−
6Feeling less sociable0.070.10−1.573.79
7Difficult family relationships0.820.11−1.037.18
8Difficulties interacting with people0.830.11−1.997.32

Correlations among subscales, and with other measures and variables.Tables 6 and 7 show correlations among MSSS-88 subscales, and between MSSS-88 subscales, validating instruments and descriptive variables. The magnitude and pattern of these correlations was generally consistent with expectations based on the constructs perceived to be measured by the instruments. This provides further evidence, albeit circumstantial, for the validity of the MSSS-88. For example, correlations among MSSS-88 subscales range from 0.35 to 0.83 (12–69% shared variation), implying the eight subscales measured related but discrete constructs. Correlations between MSSS-88 subscales and the seven validating instruments collected at the same point in time were broadly consistent with expectation. For example, the MSSS-88 emotional health and social functioning subscales correlated most highly with the MSIS-29 psychological impact subscale, SF-36 MH subscale and GHQ-12.

View this table:
Table 6

Correlations among MSSS-88 scales, and between MSSS-88 scales and other variables

MSSS-88 scaleMSSS-88 scale
Muscle stiffnessPain and discomfortMuscle spasmsADLWalkingBody movementsEmotional healthSocial functioning
Muscle stiffness1.0
Pain and discomfort0.761.0
Muscle spasms0.750.761.0
Body movements0.750.670.680.650.771.0
Emotional health0.530.500.460.410.710.641.0
Social functioning0.490.420.390.380.680.560.821.0
Other variables
Degree of spasticity*0.440.360.360.380.400.530.390.46
Indoor mobility0.
Duration of multiple sclerosis−0.01−−0.42−0.02−0.06−0.09
  • * Grading of degree of spasticity: 0 = minimal; 1 = mild; 2 = moderate; 3 = severe.

  • Indoor mobility grading: 1 = walks unaided; 2 = walks with an aid; 3 = uses a wheelchair.

View this table:
Table 7

MSSS-88: correlations with validating scales

ScaleMSSS-88 scale
Muscle stiffnessPain and discomfortMuscle spasmsADLWalkingBody movementsEmotional healthSocial functioning
MSIS-29 physical0.700.600.510.610.690.770.630.65
MSIS-29 psychological0.580.560.500.340.600.550.790.77
SF-36 PF0.380.350.400.720.550.550.310.29
SF-36 MH0.450.440.320.330.530.450.750.77
FAMS mob0.
FAMS ewb0.350.330.360.440.620.520.810.73
  • mob = mobility; ewb = emotional well-being.

Table 6 also shows the correlations between MSSS-88 subscales, self-report degree of spasticity, indoor mobility level, duration of multiple sclerosis, age and gender. The majority of correlations were consistent with expectation. For example, correlations with age and gender were low (range −0.15 to +0.09), whereas those with a four-point patient-reported grading of spasticity severity (minimal, mild, moderate and severe) were moderate (0.35–0.51).

Group differences validity.Table 8 reports the mean MSSS-88 locations for people who graded their spasticity as minimal/mild, moderate or severe. All subscales demonstrated a stepwise and statistically significant increase in mean value associated with increasing self-reported spasticity severity.

View this table:
Table 8

MSSS-88: Group difference and relative validity

ScalePerson measures mean (SD); rangePercentage ceiling/floor effectPatient-reported spasticity severityF(p)
Minimal/mild n = 43–46Moderate n = 59–60Severe n = 81–83
Muscle stiffness+0.69 (2.04); −4.63 to +5.121.5/1.5−0.40+0.43+1.6219.2*
Pain/discomfort+0.11 (2.17); −4.55 to +4.235.0/8.5−0.80−0.20+0.9412.3*
Muscle spasms−0.46 (1.77); −4.38 to +4.314.0/1.5−1.25−0.81+0.2714.6*
ADL−1.45 (2.74); −4.92 to +5.213.1/17.2+0.18+1.01+2.6114.8*
Walking+1.72 (2.34);  −5.35 to +4.611.7/19.8+0.37+1.84+2.669.4*
Body movements+1.12 (2.45); −5.16 to +5.053.0/11.2−0.57+0.91+2.5633.5*
Emotional health−0.53 (2.18); −4.93 to +4.287.5/2.5−1.57−1.07+0.4017.5*
Social function−0.59 (1.72); −3.54 to +3.668.0/3.0−1.71−0.84+0.1923.2*
  • F = ratio of between-groups to within-groups variance; ceiling = % scoring minimum value = minimum disability; floor = % scoring maximum value = maximum disability.

  • * P < 0.001.

  • Mean person measure for subgroup defining their spasticity as minimal or mild.

  • Walking scale analyses only involved people who could walk hence n's smaller: 26, 45 and 39 respectively.


The aim of this study was to develop a scale for measuring patients' perceptions of the impact of spasticity in multiple sclerosis. The resulting instrument, the 88-item Multiple Sclerosis Spasticity Scale (MSSS-88; Appendix see supplementary material), attempts to quantify the impact of spasticity in eight clinically relevant areas: three spasticity specific symptoms (muscle stiffness, pain and discomfort and muscle spasms), three areas of physical functioning (ADL, walking, body movements), emotional health and social functioning. This patient-derived model of the impact of spasticity highlights the complexity of an apparently unidimensional clinical concept. It can also be viewed as a framework for evidence-based management in the development of integrated care pathways, guidelines for care (Multiple Sclerosis Council for Clinical Practice Guidelines, 2003) and service development.

Does the MSSS-88 have too many items to be clinically useful and are the specialized skills required for Rasch analysis justified? Three questions underpin these concerns. First, why are there so many items? Second, what evidence is available and what mechanisms are in place to ensure clinical usability? Third, do the clinical advantages of Rasch analysis outweigh the necessity for specialized knowledge and software?

First, the reason there are so many items is the need for breadth, range and precision of measurement. The qualitative phase of the study identified that the clinically appropriate breadth of measurement was eight subscales. For each of these eight subscales, adequate measurement range requires the two most extreme scale items to be well separated. Measurement precision is determined by the number of units into which the range is divided and is defined mainly by the number of items of the scale. In addition, the number of items in a scale determines the confidence intervals around an individual patient's estimate (Wright and Masters, 1982), and a reasonable number of items are required to ‘anchor’ the construct measured by any scale. As clinicians and clinical trials require scales that give precise estimates of people's locations on the continuum that are also able to detect clinically significant change, scales need to have a reasonable number of items located at regular intervals across a substantial range. Thus, at this stage of scale development, we were reluctant to reduce the number of items further.

Second, there is evidence from our work, and that of others, that 88 items is not too many for clinical usefulness. The two postal surveys in this study, our surveys in non-trial samples of multiple sclerosis and other neurological disorder, and the work of others in health measurement have consistently demonstrated high response rates and data completeness despite large numbers of items. Nevertheless, we are aware that the MSSS-88 may be used as one of many outcomes and we have, therefore, constructed it to be flexible and adaptable to meet different measurement needs. For example, a clinical trial may be interested primarily in the impact of a treatment on spasticity symptoms. In contrast, a clinical service may be more interested in functional outcomes. Here, we recommend the use of the most appropriate MSSS-88 subscale/s to address the measurement question. This is possible because each subscale is a stand-alone measurement instrument.

In other situations, such as the evaluation of a busy multidisciplinary spasticity service, clinicians or researchers may wish to measure all eight areas but feel that 88 items is too many. Here, we recommend the use of a short-form version of the MSSS-88. That is, a selection of the most clinically appropriate items from each subscale. For example, users could choose items 1, 3, 5, 8, 10, 13 and 14 from the Muscle Spasms subscale to give a seven-item short form whose item locations are evenly spread across the continuum. This approach is possible because the item locations of each subscale are calibrated with respect to each other. Consequently, investigators can use any subset of items from any subscale and generate results that are referable to the long form version of that scale (Choppin, 1968, Wright, 1977). It is, however, important to be aware that scales with few items have limited precision (unless their range in very restricted) and are less able to detect small but clinically meaningful change.

Third, Rasch analysis offers four clinically meaningful scientific advantages that we believe far outweigh concerns about the necessity for specialized knowledge and software: (i) it offers clinicians the ability to construct interval-level measurements from ordinal-level rating scale data, thereby addressing a major concern of using rating scales as outcome measures (Whitaker et al., 1995; Platz et al., 2005); (ii) it enables clinicians to obtain estimates suitable for individual person analyses rather than only for group comparison studies; (iii) it enables clinicians to use subsets of items from each subscale rather than all items from the scale (detailed above), yet still be able to compare scores using different sets of items; (iv) missing item data can be handled scientifically, rather than on the basis of assumption, because Rasch analysis computes an estimate from the available data rather than requiring missing data to be imputed.

Nevertheless, Rasch analysis appears complicated, is not widely used, and there are few clinicians and researchers trained in its use and interpretation. For this reason we offer three methods for computing MSSS-88 subscale scores. In the first method, item scores can be summed, without weighting or standardization, to generate ordinal-level total scores just as any other Likert-type scale. Missing responses to items can be replaced with the mean score of the items completed (person-specific item mean score) provided that 50% or more of the items in a scale have been completed (Ware et al., 1993). In the second method of computing MSSS-88 subscale scores, the ordinal summed scores generated above can be transformed into interval-level measurements using conversion tables that can be made available with the scales. In the third method of computing MSSS-88 scores, investigators can Rasch analyse their own data. Furthermore, if these analyses use (anchor) the item and threshold locations from our dataset, available on request, people in the new sample will be measured on an identical interval-level metric to the one we have constructed.

Is the use of ordinal summed MSSS-88 scores justified and an advance over another ordinal scale? It is justified because the very use of integer and total scores depends on the data conforming to the Rasch model (Andrich, 1978). Whether or not they do, is an empirical question that is answered by a Rasch analysis. In many situations scores on items are simply summed, but it is usually done by assumption without checking thoroughly that it is a valid procedure and whether it is justified to use integer scores for the successive item response categories. Such thorough checks cannot be achieved using traditional psychometric methods (Massoff, 2002). This is one advance over other ordinal scales, another advance is the linearization of scores that follows directly from the model.

The evidence that Rasch analysis transforms ordinal summed scores into interval-level measurements lies in the properties of the mathematical model (Rasch, 1961; Wright and Stone, 1979; Andrich 1988). Effectively, the difference between (comparison of) any two people, any two items, or any one person and one item, is defined by the logarithm of the relative probabilities. In essence, albeit an oversimplification, the observed scores in the data matrix are replaced by the expected probabilities of occurrence, and relative differences computed as ratios of the relative probabilities (as these are consistent indicators of relative differences). This ratio of the relative probabilities is then expressed on a linear scale in an additive form by taking logarithms. In addition, it can be proven mathematically that the summed score is the sufficient statistic for estimating the item and person locations, and the estimation of these locations is independent of each other. As such, the Rasch model is able to transform summed scores into linear measures of persons and items that are on the same scale with a common unit and freed up from the distributional properties of each other. Thus the Rasch model realizes, mathematically, the requirements for scientific measurement (Rasch, 1961; Massof, 2002; Andrich, 2004): invariant comparisons of people, and items, on the same linear scale.

Unfortunately, Rasch analysis is applicable only to multiple item scales, not single item scales such as the Expanded Disability Status Scale (EDSS), Rankin scale, Hauser Ambulation Index, EDMUS and Ashworth scale. It may be possible to construct interval measurements from the Kurtzke Functional Systems if data satisfy the requirements of the Rasch model, and it is considered clinically meaningful to combine scores across the eight systems. Of note, Kurtzke did not think this was clinically appropriate (Kurtzke, 1961).

It is important to clarify what is implied by scores and score differences. In estimating the linear person measurements, it is not implied that, say, pain at one part of the continuum is twice that at another part of the continuum. That would require a natural zero point. Second, it is entirely possible that at different parts of the same continuum, there may be some qualitatively different responses or reactions. By analogy, in heating water more and more quantitatively, there is a qualitative reaction as it starts to boil. The investigation of possible qualitative differences on the continuum is for further clinical study having constructed a quantitative scale.

In this study we took an approach to scale development that is recommended strongly (Andrich, 2002a, b) but differs somewhat from the approach adopted by others. Specifically, we first developed and used a conceptual model to define the areas for subscale development and then used an explicit mathematical model to guide subscale development in each of these areas. It is more typical for developers of health rating scales to use statistical techniques, such as factor analysis of an item pool, to define the areas for subscale development, and then traditional psychometric methods of testing reliability and validity to refine those subscales. Using factor analysis to group items can be misleading as it partitions items according to their intercorrelations and, thus, makes the assumption that correlations between items indicate the extent to which they measure the same thing. This is an oversimplification (Duncan, 1985). Furthermore, factor analysis is strongly influenced by sample sizes (Nunnally and Bernstein, 1994), the number and type of items analysed, and the targeting of items and persons. The advantage of using an explicit mathematical model to guide scale construction is that it enables sophisticated checks on the internal validity and consistency of the scores, as well as the construction of stable linear measurement systems (Wright, 1977; Linacre et al., 1994; Massof, 2002; Andrich, 2004).

The three physical subscales of the MSSS-88 have larger floor effects than the other subscales, implying that it may be beneficial to extend their range of measurement in the future. This can be achieved without affecting the subscales as they stand, because the item locations are calibrated relative to each other. Furthermore, future scale developments can be empirically driven. The distribution of the relative item locations (Tables 25) shows the ‘gaps’ in each subscale (notable distances between item locations), which developers may wish to fill, and the distribution of person measurements shows when it may be valuable to extend the continuum in either direction. Tables 25 indicate the nature of items either side of the gaps, and those that define upper and lower limits of measurement. Consequently, potential items that are appropriate to those locations can be generated and examined in small samples to get reasonably accurate preliminary calibrations before needing to undertake more definitive surveys.

This study has limitations. The use of a controlled clinical study population, perhaps more motivated than the general multiple sclerosis population, might give a false impression of the utility of an 88-item scale in everyday practice. We discussed earlier circumstantial evidence that this may not be the case. Other limitations are that this work contributes little to our understanding of the relationship between self-assessment and objective clinical findings in spasticity, and the potential implications of study medications on the findings. The CAMS study finished 12–18 months before MSSS-88 measurements were collected, so concurrent Ashworth measurements were not available and all patients were off study medication. Nevertheless, we predict low correlations between these two instruments (e.g. <0.50) because they quantify very different clinical manifestations of spasticity and capture different people's perspectives. The Ashworth scale is a clinician-based evaluation of a clinical sign (muscle tone) at rest, whereas the MSSS-88 is a subjective assessment of day-to-day spasticity symptoms and functional impact in eight clinically separate areas. Consequently, the two instruments should be viewed as complementary, not competing, outcome measures for clinical trials.

Measurement of different manifestations also underpins the finding of low correlations between neurophysiological, biomechanical and clinical indicators of spasticity. Some interpret these findings with surprise or disappointment (Wood et al., 2005), or question the validity of one or other or both measurement methods. However, low correlations between different indicators of spasticity are predictable and appropriate, and have a number of important implications. They highlight that the selection of outcome measures underpins the meaningful interpretation of studies, that a range of carefully selected and complementary outcomes may need to be measured, that measures of one entity are unlikely to be adequate surrogate markers of another (e.g. MRI and disability) and that relying on correlations to validate scales can be limited.

The MSSS-88 represents an attempt to conceptualize and measure the impact of a complex neurological problem from the patients' perspective. Qualitative research, clinical experience and sophisticated measurement methods have been integrated to develop a scale that complements existing methods of evaluating spasticity in multiple sclerosis. It has great potential to advance the measurement and, thereby, management of this disabling clinical problem. Further examination and responsiveness testing are now required to understand the clinical meaning of MSSS-88 scores and score changes, and evaluations of the instrument in people with non-multiple sclerosis spasticity will determine its applicability as a generic instrument.

Supplementary material

For a copy of the MSSS-88 scale, see Supplementary material at Brain online.


The authors wish to thank the patients who participated and Professor David Andrich for his input. Dr Hobart's recent sabbatical in Australia was supported by the Royal Society of Medicine (Ellison-Cliffe Travelling Fellowship), the Peninsula Medical School, the Multiple Sclerosis Society of Great Britain and Northern Ireland, and the NHS Health Technology Assessment Programme. However, the views and opinions expressed are not necessarily those of the NHS Executive. The Neurological Outcome Measures Unit is supported by the De Lazlo Foundation.


View Abstract