Brain Advance Access published online on May 18, 2008
Brain, doi:10.1093/brain/awn081
Review Article |
The complex genetics of multiple sclerosis: pitfalls and prospects
University of Cambridge, Department of Clinical Neurosciences, Addenbrooke's, Hospital, Hills Road, Cambridge, CB2 2QQ, UK
Correspondence to:
Stephen Sawcer, University of Cambridge, Department of Clinical Neurosciences, Addenbrooke's, Hospital, Hills Road, Cambridge, CB2 2QQ, UK E-mail: sjs1016{at}mole.bio.cam.ac.uk
| Summary |
|---|
|
|
|---|
The genetics of complex disease is entering a new and exciting era. The exponentially growing knowledge and technological capabilities emerging from the human genome project have finally reached the point where relevant genes can be readily and affordably identified. As a result, the last 12 months has seen a virtual explosion in new knowledge with reports of unequivocal association to relevant genes appearing almost weekly. The impact of these new discoveries in Neuroscience is incalculable at this stage but potentially revolutionary. In this review, an attempt is made to illuminate some of the mysteries surrounding complex genetics. Although focused almost exclusively on multiple sclerosis all the points made are essentially generic and apply equally well, with relatively minor addendums, to any other complex trait, neurological or otherwise.
Key Words: multiple sclerosis; genetics; association, linkage
Abbreviations: GWAS, Genome-Wide Association Study; HLA, human leucocyte antigens; IL2R, interleukin-2 receptor; IL7R, interleukin-7 receptor; IMAGEN, International MHC and Autoimmunity Genetics Network; IMSGC, International Multiple Sclerosis Genetics Consortium; LD, linkage disequilibrium; MHC, major histocompatibility complex; MAF, minor allele frequency; NIMH, National Institutes of Mental Health; nsSNPs, non-synonymous single nucleotide polymorphism; RAF, risk allele frequency; SNP, single nucleotide polymorphism; WTCCC, Wellcome Trust Case Control Consortium
Received March 10, 2008. Revised March 27, 2008. Accepted April 2, 2008.
| Evidence for the influence of genetics |
|---|
|
|
|---|
Most, if not all, common diseases are characterized by an increased frequency in the relatives of affected individuals, and multiple sclerosis is no exception. Amongst white individuals living in temperate regions, the prevalence of multiple sclerosis is typically 1/1000, yet 15–20% of patients report a family history of the disease, a rate which is significantly more than would be expected by chance (Compston and Coles, 2002
s; the relative risk of the disease seen in the siblings of affected individuals as compared to that seen in the general population (Risch, 1990
15 (Sawcer, 2006
The epidemiology of multiple sclerosis continues to be scrutinized and interesting nuances will no doubt continue to emerge. However, it is important to recognize the limitations of this approach. These studies are extremely difficult to perform and frequently subject to confounding and bias. Moreover, although multiple sclerosis is a common disease in neurological practice, the relatively modest frequency of the disease in the population as a whole (1/1000) means that even huge population-based studies can only provide crude estimates for recurrence risks, and other basic epidemiological parameters, such as age-specific incident rates and life time risk, all of which come with large confidence intervals. It seems likely that most, if not all, of any apparent inconsistencies in the epidemiology of multiple sclerosis stem from the variability inherent in underpowered studies. In light of these considerations, it is easy to see why predictable effects, such as the Carter effect, have only inconsistent support (Hupperts et al., 2001
; Ebers et al., 2004
; Kantarci et al., 2006
; Herrera et al., 2007
) (see Supplementary material section 1). In a complex disease like multiple sclerosis where epidemiological parameters are impossible to measure reliably and multiple potentially conflicting effects are likely to exist, it seems unlikely that epidemiological analysis will ever provide any major insights. In short, epidemiological analysis has convincingly shown us that genetic factors are relevant but lacks any power to illuminate the nature or extent of these factors beyond indicating that multiple genes are involved (Wang et al., 2005
; Lindsey, 2005
).
| Early success |
|---|
|
|
|---|
Initial attempts to identify genes influencing susceptibility to multiple sclerosis were highly successful and quickly identified the now well-established relevance of the Major Histocompatibility Complex (MHC). Unfortunately, this early success has been followed by several decades of frustration in which no other undeniably relevant loci emerged until 2007. Before contemplating why it has been so difficult to map non-MHC loci, it is worth revisiting the MHC association, its discovery and subsequent dissection.
Association between the Human Leukocyte Antigens (HLA) and multiple sclerosis was first identified in 1972. Using cell culture-based methods researchers from California found association with HLA-A3 (Naito et al., 1972
) while others from Denmark found association with HLA-B7 (Jersild et al., 1972
). The following year the same Danish group also established association with DR2 (Jersild et al., 1973
). The nomenclature used to describe HLA is complex and has evolved considerably over the years. At the time of these original discoveries very different designations were used, such that the phenotypes associated with the HLA-A3, HLA-B7 and DR2 alleles were known respectively as HL-A3, HL-A7 and LD-7a. It was quickly realised that these were not independent associations but were rather a reflection of linkage disequilibrium (LD, see Supplementary material section 2) between the corresponding alleles and association of the disease with a haplotype including these alleles (Compston et al., 1976
; Terasaki et al., 1976
). The molecular genetic dissection of these associations began in 1984 when Cohen et al. (1984
) used analysis of restriction fragment length polymorphisms to directly establish association with the HLA-DR2 allele (Cohen et al., 1984
). Over the years, technology has improved and the resolution of the associated alleles has been refined (Vartdal et al., 1989
; Olerup and Hillert, 1991
).
All of these associated HLA genes lie in the MHC, a gene-dense region of the genome characterized by extensive LD and extreme levels of polymorphism (Horton et al., 2004
). In light of these features, it is not unsurprising to find that many variants from other genes in this region also show association with multiple sclerosis (Lincoln et al., 2005
; Yeo et al., 2007
). The modest levels of LD between the class I region (containing the HLA-A and HLA-B genes) and the class II region (containing the DRB1 and DQB1 genes) enabled researchers to quickly establish that association primarily derived from the class II region (Compston et al., 1976
; Terasaki et al., 1976
). However, the more extensive LD between DRB1 and DQB1 made it much more difficult to refine which of these genes was primarily responsible for the association. Studying African American patients, who have less intense LD between DRB1*1501 and DQB1*0602, Oksenberg et al. (2004
) provided the first convincing evidence that the primary association was with the DRB1 gene, an observation which has been confirmed in subsequent studies in large cohorts of patients of European descent (Yeo et al., 2007
). Because of further evolution in the nomenclature of HLA genes what was previously called DR2, is these days referred to as DR15; the DRB1*1501 allele is the most common sub type of DR15 seen in white Europeans.
Looking back at some of these original studies in light of current knowledge is highly informative. Even though the extent of linkage disequilibrium between HLA-A3 and HLA-DRB1*1501 is modest (D' = 0.3, r2 = 0.14) the original study by Naito et al. (1972
), which included 94 cases and 871 controls, had >50% power to identify association with A3 at the 5% level. This early study thus illustrates well the principle that genuine associations can indeed be identified by typing markers in LD with real effects even when the level of LD is modest. Of course the saving grace for Naito et al. (1972
) was the strength of the association with the DRB1*1501 allele and their use of a large cohort of controls. In the study by Compston et al. (1976
), a class II locus was considered directly and nominally significant association was confirmed using just 83 cases and 32 controls.
It is now well established that the association of multiple sclerosis with the DRB1*1501 allele is almost ubiquitous, the relevance of this allele having been confirmed in virtually every population tested (Compston et al., 2006
). The fact that other MHC haplotypes also influence susceptibility is well established (Marrosu et al., 1998
) and recent data indicate that the risk associated with *1501 may be modified depending upon which other MHC haplotype is carried in the heterozygous state (Dyment et al., 2005
; Barcellos et al., 2006
; Ramagopalan et al., 2007
). However, it is unclear whether these additional signals stem primarily from the DRB1 gene or from the effects of other MHC loci. Many researchers have found evidence supporting the existence of an independent signal from the class I region (Fogdell-Hahn et al., 2000
; Marrosu et al., 2001
; de Jong et al., 2002
; Rubio et al., 2002
; Harbo et al., 2004
; Yeo et al., 2007
) although not all (Lincoln et al., 2005
). Again, this apparent inconsistency is not unexpected. Establishing the presence of additional susceptibility loci located close to a primary locus is complex especially in the presence of prominent LD and likely allelic heterogeneity (Koeleman et al., 2000
). Once correction for the effects of LD with the DRB1*1501 allele are made, the residual power in even the largest of these studies is modest (Dyment et al., 2005
; Lincoln et al., 2005
; Barcellos et al., 2006
; Yeo et al., 2007
). A role for the DRB1*03 haplotype seems beyond doubt (Dyment et al., 2005
; Barcellos et al., 2006
; Yeo et al., 2007
) and Sardinian data would suggest that this association most likely stems from the DRB1 gene (Marrosu et al., 2001
). Beyond this it is clear that the MHC contains further signals but their nature and origins are as yet unresolved.
By cataloguing variation in the MHC through the re-sequencing of specific haplotypes (Allcock et al., 2002
; Horton et al., 2008
), and empirically establishing the complex patterns of LD across the region (Miretti et al., 2005
), it has been possible to establish a comprehensive panel of haplotype tagging single nucleotide polymorphisms (SNPs) (de Bakker et al., 2006
). These SNPs are currently being typed in multiple sclerosis and a number of other autoimmune diseases as part of the International MHC and Autoimmunity Genetics Network project. Hopefully, these systematic fine-mapping efforts will help to unravel this complex association, although it can be anticipated that large sample sizes will be needed to confirm the findings emerging from this project.
| The rest of the genome |
|---|
|
|
|---|
Outside the MHC, the genetic analysis of multiple sclerosis has been considerably less successful, with no consistent findings emerging until very recently. This lack of any convincing progress has been a source of great frustration, and the inconsistency in early claims has rightly been criticized (Hirschhorn et al., 2002
| The needles |
|---|
|
|
|---|
Obviously we cannot know a priori what effects on risk will be conferred by individual susceptibility alleles, nor can we know their frequency or mode of inheritance. However, linkage analysis has provided us with invaluable guidance regarding an upper limit on these effects sizes, which researchers cannot afford to ignore.
The fact that association with the MHC can reliably be detected with modest resources (c 100 cases and 100 controls) and yet only accounts for a fraction of the heritability seen in multiple sclerosis meant that in the late 1980s and early 1990s there was an expectation that non-MHC loci would be rather easy to find. At this time, there was a feeling that perhaps susceptibility to multiple sclerosis might be determined by just a handful of effects similar, or perhaps even larger, to that conferred by the MHC. Coincident with this the human genome project reached the point where systematic whole-genome screening for linkage became possible (see section 3 of Supplementary material). In 1996, the results of whole-genome screens for linkage to multiple sclerosis from the UK, the US and Canada were published back to back (Ebers et al., 1996
; Haines et al., 1996
; Sawcer et al., 1996
). Each of these studies was based on
100 affected sib pairs and employed 300–400 microsatellite markers. Subsequently similar studies were performed in multiplex families from Finland (Kuokkanen et al., 1997
), Sardinia (Coraddu et al., 2001
), Italy (Broadley et al., 2001
), Scandinavia (Akesson et al., 2002
), Australia (Ban et al., 2002
) and Turkey (Eraksoy et al., 2003
), and in addition each of the original three groups extended their analysis using further families and more microsatellite markers (Hensiek et al., 2003a
; Dyment et al., 2004
; Kenealy et al., 2004
). Interestingly, none of these studies identified any statistically significant linkage, not even in the region of the MHC. Attempts at meta-analysis were no more successful, although linkage to the MHC just reached genome-wide significance in some of these studies (Ligers et al., 2001
; GAMES and the Transatlantic Multiple Sclerosis Genetics Cooperative, 2003
). Contemplating the reasons for this disappointing lack of linkage, the International Multiple Sclerosis Genetics Consortium (IMSGC) identified a number of issues which might have confounded these studies (IMSGC, 2004
) and in an attempt to correct for these re-screened the genome for linkage using a dense map of SNPs in families from Australia, Scandinavia, the US and the UK (IMSGC, 2005
). In the final analysis, this substantially larger study included data from 4506 SNPs typed in 730 multiplex families which between them provided almost 1000 affected relative pairs. The increased power provided by this screen in comparison with its predecessors is evident from the overwhelming evidence for linkage found in the MHC region where a lod score of 11.7 was observed (IMSGC, 2005
). Once again, however, no other region of statistically significant linkage was apparent. The comprehensive marker map used in this study makes it virtually impossible that any signals of a magnitude similar to that attributable to the MHC could have been missed. As with the previous studies the number of suggestive linkage peaks was significantly greater than would have been expected by chance alone (IMSGC, 2005
), indicating that there is excess allele sharing but providing no clear guide as to the location of relevant genes.
Although these linkage data provide no useful information concerning the location of non-MHC susceptibility loci the observed allele sharing does provide useful guidance concerning the size of effects attributable to such loci (Risch, 1990
). Employing the approach suggested by Risch and Merikangas (1996
), and remembering that the observed allele sharing is expected to provide a significantly inflated estimate of effect size (Goring et al., 2001
), it is straightforward to show that common non-MHC risk alleles are highly unlikely to increase risk by more than a factor of 2.0. Under these circumstances, it is clear that further linkage analysis is almost certain to be unrewarding since the number of sib pair families necessary to demonstrate significant linkage is impractically large (Risch and Merikangas, 1996
) (see Supplementary material section 3). Fortunately, association-based studies are significantly more powerful and thus provide a means to identify genes exerting effects which fall below the resolution of linkage (Risch and Merikangas, 1996
). However, even the most optimistic estimates of effect size consistent with the available linkage data indicate that association studies will need to involve at the very least 500 cases and 500 controls (Sawcer, 2006
). Since most of the published literature regarding the genetics of multiple sclerosis has been based on significantly smaller numbers one corollary of this is that almost all previous studies have been seriously underpowered. There are virtually no loci, with the possible exception of APOE (Burwick et al., 2006
), where published studies have been adequately powered to confidently exclude the possibility of a meaningful effect. It seems highly likely that many of the entirely plausible candidates considered to have been excluded on the basis of the absence of any consistent evidence to date will eventually emerge as genuinely relevant in the disease. Coupling this limited effect size with the fact that for most genes no more than a tiny fraction of the variation has ever been tested, it is clear that few if any genes have received a thorough analysis. It is surely the virtual absence of any power that is responsible for nearly all the apparent inconsistency in the literature concerning the genetics of complex diseases such as multiple sclerosis (Lohmueller et al., 2003
).
Of course, it remains possible that a large extended family in which a rare more penetrant allele is segregating might be found, and that the identification of such an allele might be informative regarding the pathogenesis of the disease, much as the identification of mutations in the alpha-synuclein gene has been informative regarding the pathogenesis of Parkinson's disease (Polymeropoulos et al., 1997
). However, analysis of the few such families so far reported has failed to identify any significant linkage let alone any relevant loci (Dyment et al., 2002
; Modin et al., 2003
). These larger families are characterized by multiple affected siblings rather than multiple affected generations, rarely show the greater degree of consistency in phenotype that would be expected but instead show an increased frequency of DRB1*1501 carriage, the reverse of what would be expected if a non-MHC locus were primarily responsible for the disease (Willer et al., 2007
). It also remains possible that some otherwise rare alleles of higher penetrance might have drifted through a genetic bottleneck and thereby become frequent in a population isolate. However, given the surprising extent of identity by descent seen in apparently unrelated individuals (Frazer et al., 2007
), it is hard to imagine that there will be much power to separate such alleles through homozygosity mapping.
| The haystack |
|---|
|
|
|---|
Although there are no clear data regarding the genetic architecture underlying susceptibility to multiple sclerosis considerable progress has been made concerning the nature and extent of genetic variation in the human population in general (Frazer et al., 2007
90%) of the differences between any two individuals is attributable to common variants, where both the alleles are seen in at least 1% of the population (Wang et al., 2005
100 000:1 against (assuming that there is no significant LD between the various risk alleles).
By calculating the power of a study to identify any particular level of significance (Purcell et al., 2003
) and using the above estimate for the prior odds we can determine the odds that a result with any particular level of significance is a true positive (see Supplementary material section 4). Figure 1 shows the posterior odds for studies of differing size assuming that the risk alleles relevant in multiple sclerosis are common (frequency 10%) and have a GRR of 1.3 (under a multiplicative model).
|
From this figure, we can see that P-values in the range of 5–0.1% will virtually always be false positives, even in well-powered studies. This primarily occurs because the prior odds are so extreme that it remains more likely that this level of significance has arisen by chance in an unassociated marker than that we happen to have considered a genuinely associated marker. As the P-value becomes more extreme the probability of seeing such a result by chance alone is reduced (by definition) and therefore it becomes increasingly likely that the result is a true positive. While this is intuitively expected it is perhaps counter-intuitive to see that the smaller the study (i.e. the less power in a study) the greater the level of significance needs to be before a result becomes more likely to be true than false (WTCCC, 2007
|
These figures illustrate the need for adequate power. It is only when studies have sufficient power that we can rely on the prediction that P-values of <5 x 10–7 will more often be true than false. Figure 3 shows the sample size required in order to ensure that P-values of <5 x 10–7 will indeed be more often true than false in terms of the GRR conferred by susceptibility loci.
|
In the analysis presented above, it has been assumed that the variant chosen for study has been selected at random from amongst the full list of common variation. In practice, researchers do not do this but instead tend to use some prior knowledge or existing information to guide the selection of candidates. Of course, the validity of these prior data cannot be known and assumptions used to guide the selection of candidates may be invalid. In this way, we can imagine that random selection represents the worst-case scenario. Information used to guide the selection of candidates may come from many different sources such as data from animal models, expression studies, biological or pathological analysis. For example, the evidence that multiple sclerosis is an autoimmune disease is overwhelming and makes any gene with immunological function a logical candidate. However, since perhaps a fifth of genes have an immunological function using this information to guide the selection of candidates would only improve the prior odds by a factor of 5 taking them from 100 000:1 to 20 000:1. These odds are certainly reduced but come at a price since the likelihood of discovering non-immunological genes of relevance has been greatly reduced. Since non-synonymous coding variants and variants in regulatory regions or splice sites are more likely to have a functional effect than variants in silent non-coding regions it has also been suggested that concentrating analysis on these more functional relevant variants could also improve the prior odds (Tabor et al., 2002
|
From this figure we can see that even for well-selected candidates studied in cohorts involving as many as 1000 cases and 1000 controls modest P-values (in the range of 5–0.1%) are still much more likely to be false positives than true. On the other hand, in a candidate gene study this number of samples is sufficient to ensure the reliability of more stringent P-values such as 5 x 10–7.
It is expected that the frequency of risk alleles will vary from locus to locus so it is reasonable to enquire how this variable might influence the interpretation of results. Figure 5 shows the relationship between Risk Allele Frequency (RAF) and the power to identify meaningful association (P-value <5 x 10–7).
|
Inspection of Fig. 5 shows that the power drops precipitously as the minor allele frequency (MAF) falls below 20% (corresponding to RAF values of <20 or >80%) even in large study cohorts. Once the MAF falls below 5% there is virtually no power. On the other hand, for intermediate values of RAF there is relatively little variation in power.
The effects of heterogeneity on the ability to identify common susceptibility variants is also worthy of consideration. A degree of heterogeneity is to be expected (Wang et al., 2005
) and the extent to which this and other sources of confounding, such as diagnostic inaccuracy, reduce the power to identify association is clearly relevant. Figure 6 indicates the consequences of including phenocopies in the case cohort (Gordon et al., 2002
; Edwards et al., 2005
).
|
From Fig. 6 it is clear that a surprisingly high level of phenocopy inclusion can be tolerated. This observation should not be interpreted as arguing for careless phenotyping, clearly power will be reduced every time a phenocopy is mistakenly included in a study and every effort should be made to keep this to a minimum. On the other hand, given that a degree of heterogeneity is expected it is important to realize that even if this amounted to as much as 10–15% of the disease it would still be possible to identify relevant common variants. Some evidence for heterogeneity in multiple sclerosis has been identified although this probably amounts to no more than 1–2% of the disease. In the early 1990s, it was realized that some patients with Leber's Hereditary Optic Neuropathy (an optic atrophy caused by specific mutations in mitochondrial DNA) developed a disease that, apart from the prominence of visual failure, was clinically and radiologically indistinguishable from multiple sclerosis (Harding et al., 1992
Some investigators feel that primary progressive disease is a distinct condition and should be considered separately from relapse remitting disease, while others feel that this apparent distinction is just a reflection of the fact that the activity of the relapsing component of the disease is highly variable, being essentially absent in some cases and prominent in others. Detailed analysis of the natural history of the disease has shown that progression is essentially independent of relapse activity and indistinguishable between primary progressive and relapse onset cases (Compston, 2006
; Confavreux and Vukusic, 2006a
, b
; Kremenchutzky et al., 2006
). In the same way, pathological and radiological differences between primary progressive and relapsing onset disease are largely a reflection of relapse activity rather than being distinct. It seems likely that genetic factors will influence the course of multiple sclerosis, and there is evidence for a degree of concordance within multiplex families with respect to course (Hensiek et al., 2007
). However, it also seems likely that in terms of susceptibility factors there will be rather more in common between primary progressive and relapsing disease than different, certainly there is no convincing evidence for any difference between these two groups in terms of the susceptibility factors thus far established.
| Genome-wide association studies (GWAS) |
|---|
|
|
|---|
One logical way to improve the odds of identifying susceptibility factors would be to consider all common variation rather than just a single randomly selected or candidate variant. If all common variation were to be typed in a study then this study would be sure to include an analysis of the relevant variants. In this situation, concerns about prior odds might be ignored and tests simply interpreted after some correction for multiple testing. However, rather predictably, nothing is gained by adopting this approach to analysis since the statistical penalty required to correct for multiple testing is equivalent to that incurred by allowing for the prior odds (Freimer and Sabatti, 2004
Testing for association indirectly introduces another variable which influences the power to identify relevant loci, the extent of LD between the tested variant and the causative allele (Moskvina and ODonovan, 2007
). Figure 7 shows the effects of LD on power to identify meaningful association in studies of differing size.
|
From Fig. 7 it is clear that power falls dramatically as LD declines unless study samples are large enough as to ensure reserve power. In a study involving 10 000 cases and 10 000 controls there would be little difference in power between causative variants and those in LD with r2 > 0.8 (when considering variants with a frequency of 10% that increase risk by a factor of 1.3 or more). In the context of the MHC, even lower levels of LD can generate highly significant associations at test markers as the signal from the causative (DRB1*1501) allele is so strong.
One of the most notable features of the GWAS completed to date is the consistent observation that association tests are modestly inflated at neutral markers (i.e. those that do not influence susceptibility) in comparison with what would be expected if sampling error were the only source of variance. Exploring this systematic genomic inflation Clayton et al. (2005
) established that this modest but discernable effect stems from two influences, population stratification and differential missingness. Population stratification refers to the generation of a case-control allele frequency difference due to systematic difference in ancestry between the cases and controls, in other words incomplete matching of cases and controls with regard to ancestry (Thomas and Witte, 2002
). Although this effect has long been suggested as a source of false-positive association (Lander and Schork, 1994
), the results from GWAS thus far published have empirically confirmed the prediction that the effect would rarely account for anything more than modest inflation of association (Cardon and Bell, 2001
). In addition the WTCCC has shown that with very few exceptions allele frequencies do not vary significantly across the UK thereby confirming that within populations like the UK hidden stratification will rarely if ever produce more than modest inflation in the evidence for association (WTCCC, 2007
). Unfortunately the MHC is one of the loci where allele frequencies do vary considerable across the country thereby raising the possibility that population stratification could confound the analysis of this locus if cases and controls are not adequately matched. However, having completed a GWAS it is straightforward to identify ancestry and compensate for any stratification (Devlin and Roeder, 1999
; Bacanu et al., 2000
; Devlin et al., 2004
). By studying individuals whose ancestry is known to involve a mix of ethnic groups, which vary in their susceptibility to multiple sclerosis, population stratification can actually be used to help map risk loci (Smith et al., 2004
). Employing this admixture approach in African American patients Reich et al. (2005
) identified a region on chromosome 1 where European ancestry was in statistically significant excess but this group has not yet been able to fine map the region and identify the relevant gene. Interestingly these researchers found no evidence for any distortion in ancestry in the MHC suggesting that the DRB1*1503 allele which is common in African individuals likely confers the same risk as the DRB1*1501 allele, which is more common in Europeans. In short these data suggest that the difference in risk seen between African and European individuals is unlikely to stem from the MHC and may well be determined by the yet to be defined locus on chromosome 1.
Differential missingness refers to the allele frequency difference that develops between cases and controls when genotyping failure is non-random with respect to genotype and differs in extent between cases and controls, i.e. when there is a difference in the amount of non-random missing information between cases and controls (Clayton et al., 2005
). In fact, genotyping failure is almost always non-random with respect to genotype with the result that genotyping efficiency, the extent to which genotyping is complete, is one of the most valuable measure of data quality. Only analysing markers with adequate levels of genotyping efficiency and no significant difference in genotyping efficiency between the cases and controls minimises the effects of this phenomenon. Since the perturbing influences of this effect are dependant upon MAF the genotyping efficiency threshold required must be more stringent for markers with MAF of <10%.
| Recent progress |
|---|
|
|
|---|
Using genomic convergence Fernald et al. (2005
Alongside these candidate gene efforts, 2007 saw the publication of two GWAS studies in multiple sclerosis. In the first, the IMSGC screened 931 trio families (half from the US and half from the UK) using 334 923 SNPs (IMSGC, 2007
). As would be predicted the limited power provided by 931 trio families meant that no unequivocal associations were identified in the screening phase, outside of the expected signals from the MHC. However, by utilizing additional controls from the WTCCC (n = 1475) and the National Institutes of Mental Health (n = 956) along with candidate gene information, a short list of 110 loci were followed up in an additional 2931 cases and 4205 controls. In the final analysis (employing a total of 12 360 individuals), association with rs6897932 (the IL7R associated SNP) was confirmed and significant association was also established with rs12722489 (P = 3.0 x 10–8) and rs2104286 (P = 2.2 x 10–7) from the interleukin-2 receptor (IL2R) gene making this the second non-MHC locus to be established in multiple sclerosis (IMSGC, 2007
). In the second GWAS, performed by the WTCCC (Burton et al., 2007
), 975 cases and 1466 controls were screened with 12 374 non-synonymous SNPs (nsSNPs). Again, the limited power provided by the cohort size meant that the screen failed to identify any unequivocally associated markers. However, it is relevant to note that rs6897932 was the eighth most associated marker identified, confirming that a GWAS-based approach would have identified this association had it not already been established through the candidate gene approach.
Attempts to follow up on the other potential associations identified in these screens are underway alongside additional screens which will help to refine the ranking of tested variants.
| Putting things in context |
|---|
|
|
|---|
It is worth pausing to consider the nature of these new findings. Taking the IL7R association as an example we can see that the multiple sclerosis associated allele of rs6897932 has a frequency of 72% which means that
9 out of every 10 white Europeans carry this risk allele, which therefore certainly would qualify as a common variant. The allele is estimated to increase the risk of the disease by a factor of just 1.2. Using these parameters we can calculate the significance level (P-value) that would be expected in attempts to replicate this finding as shown in Fig. 8.
|
It is clear from this figure that a replication study will need to involve at least 2000 cases and 2000 controls if it is to have >95% power to demonstrate a nominally significant P-value of 5%. Most attempts at replication involving more than 600 cases and 600 controls will be expected to yield a P-value of <5% but not all. Studies with less than 600 cases and 600 controls are unlikely to identify even nominally significant association. It will be important to keep these values in mind when interpreting replication studies. If a study involving just 400 cases and 400 controls fails to identify nominally significant association this should not be interpreted as evidence that rs6897932 is not relevant in the tested population. This is perhaps the least likely explanation.
Taking these estimates of effect size and allele frequency we can calculate the lod score that rs6897932 would be expected to generate in a set of 100 sib pairs. This turns out to be <0.01! In short, loci such as rs6897932 will not be expected to generate any linkage signals discernable in previously published linkage screens. Thus, any apparent concordance between identified susceptibility loci and previously reported linkage peaks is entirely coincidental.
| In conclusion |
|---|
|
|
|---|
An analogy might serve to summarize the relative strengths of the many and varied methods which have been used to try and unravel the complex genetics underlying susceptibility to multiple sclerosis. Epidemiological analysis might be likened to a hand-held magnifying glass; it has allowed us to demonstrate that there are genetic factors to be found but because of its inherent imprecision and vulnerability to confounding is unable to reveal any greater detail. Linkage analysis on the other hand can be likened to a light microscope, large details such as the relevance of the MHC can be seen but this approach lacks the resolution needed to identify any other detail. It would be wholly inaccurate to infer that the failure of this insensitive instrument to identify any other detail implies that there are no further genes involved or that additional genes will be of no biological importance. GWAS provides us with the equivalent of an electron microscope, using this tool we are finally starting to identify relevant non-MHC loci and unravel the nature of susceptibility to multiple sclerosis. If funding agencies can be persuaded to follow this long road to its logical conclusion and support a 10 000-patient strong GWAS along with the necessary replication and fine-mapping







