Artificial intelligence for dementia genetics and omics
Janice M. Ranson and David J. Llewellyn are joint senior authors.
Abstract
Genetics and omics studies of Alzheimer's disease and other dementia subtypes enhance our understanding of underlying mechanisms and pathways that can be targeted. We identified key remaining challenges: First, can we enhance genetic studies to address missing heritability? Can we identify reproducible omics signatures that differentiate between dementia subtypes? Can high-dimensional omics data identify improved biomarkers? How can genetics inform our understanding of causal status of dementia risk factors? And which biological processes are altered by dementia-related genetic variation? Artificial intelligence (AI) and machine learning approaches give us powerful new tools in helping us to tackle these challenges, and we review possible solutions and examples of best practice. However, their limitations also need to be considered, as well as the need for coordinated multidisciplinary research and diverse deeply phenotyped cohorts. Ultimately AI approaches improve our ability to interrogate genetics and omics data for precision dementia medicine.
Highlights
- We have identified five key challenges in dementia genetics and omics studies.
- AI can enable detection of undiscovered patterns in dementia genetics and omics data.
- Enhanced and more diverse genetics and omics datasets are still needed.
- Multidisciplinary collaborative efforts using AI can boost dementia research.
1 INTRODUCTION
Dementia results from a variety of heterogeneous pathologies, such as Alzheimer's disease (AD), Parkinson's disease dementia (PDD), dementia with Lewy bodies (DLB), frontotemporal dementia (FTD), and cerebrovascular disease.1 The number of people living with dementia worldwide is around 45 million and, as life expectancy increases and populations age, this number is expected to increase.2 Genome-wide association studies (GWAS) have led to the identification of an increasing number of genetic loci associated with the risk of dementias and related neurodegenerative diseases in older adults, primarily of European ancestry.3-10 However, even with established bonafide associations, the task of characterizing variants and genes in the context of complex disease molecular pathophysiology, as well as its interacting genes and pathways, remains a daunting challenge.11
Recent progress in cutting-edge genetic and omics technologies, such as epigenomics, transcriptomics, proteomics, and metabolomics, which refer to the comprehensive assessment of a set of specific types of biological molecules, allied with emerging computational methods, hold promise of faster discoveries. However, because of the large number of associations investigated in most omics scale studies, it is necessary to have large sample sizes collected in a consistent manner. Scaling up multidisciplinary dementia studies, such as those using omics approaches, comes with challenges and implies the need of coordinated efforts from clinicians, basic and computational scientists. Appropriate funding and infrastructures capable of dealing with large numbers of biological samples and big data are also needed.
As the omics field continues to expand in dementia research, artificial intelligence (AI)-powered technologies, and in particular machine learning (ML) and deep learning (DL), are well-suited for the detection of undiscovered patterns in high-dimensional data and advance dementia research in unprecedented ways (Figure 1). Coronavirus disease 2019 (COVID-19) demonstrated that progress can rapidly be made toward tackling a disease when certain scientific practices are altered.12 Coordinated action across interested parties can result in extraordinary progress within short periods of time. Significant progress could be made rapidly in dementia research if interested parties were able to organize such that we could tackle the systemic problems that hold back the field, some of which are discussed below.

Here we identify and discuss five unresolved key questions in dementia research, which could be addressed using omics combined with advanced AI approaches: (1) How can we enhance genetic studies to inform our understanding of dementia risk? (2) Can we find reproducible omics brain signatures that differentiate between dementia subtypes? (3) Can high-dimensional omics data identify improved molecular biomarkers for dementia compared to single marker approaches? (4) How do we use genetics to inform our understanding of causal risk factors? And (5) Which biological processes are altered by genetic risk for dementia-related diseases? Tackling these questions is crucial to improving our understanding of dementia, and involves coordinating a multitude of players whose expertise go well beyond omics. It also involves improving the availability of bioresources and clinical data as well as developing analytical tools and ML algorithms to deal with high-dimensional and heterogeneous data. We note some of the challenges which must be surmounted to answer these questions within the next decade. In each instance, we highlight possible solutions and exemplar projects and communities, who have set good examples that can be used to improve our performance as a dementia research community.
This review is one of a series of eight articles in a Special Issue on ‘Artificial Intelligence for Alzheimer's Disease and Related Dementias’ published in Alzheimer's & Dementia. Together, this series provides a comprehensive overview of current applications of AI to dementia, and future opportunities for innovation to accelerate research. Each review focuses on a different area of dementia research, including experimental models [this issue], drug discovery and trials optimization [this issue], genetics and omics (this article), biomarkers [this issue], neuroimaging [this issue], prevention [this issue], applied models and digital health [this issue], and methods optimization [this issue].
RESEARCH IN CONTEXT
-
Systematic review: Our understanding of dementia etiology, pathology, biomarkers, and risk factors remains limited. We identified and discussed five key challenges hampering progress in dementia genetics and omics and proposed solutions to boost dementia-related discoveries.
-
Interpretation: The use of artificial intelligence (AI), including aspects of machine learning, deep learning, and advanced AIs, is still in its infancy in dementia genetics and omics research. Although not a panacea, these powerful analytic approaches have the potential to identify and relate relevant dementia features at an unprecedented speed and depth to transform data into improved knowledge.
-
Future directions: As a research community, we must develop more effective multidisciplinary collaborative efforts to enhance dementia-related datasets, leveraging AI approaches to prioritize drug targets, synthesize information, and facilitate knowledge transfer to diverse audiences in addition to simply speeding up the coding and analytic process.
2 KEY CHALLENGES
2.1 How can we enhance genetic studies to inform our understanding of dementia risk?
2.1.1 State of the science
The majority of GWAS rely upon logistic or linear regression-based approaches to test for associations between individual genetic variants (single nucleotide polymorphisms; SNPs) and a binary or continuous outcome.13, 14 This process is repeated until an estimate of association has been generated separately for each genetic variant. Then p-values are used to gauge whether any of these individual associations are strong enough to be considered genome-wide significant when correcting for multiple testing (a conventional threshold for ‘hits’ is 5 × 10−8).15 After a GWAS has been conducted it is often then possible to construct a polygenic risk score (PRS) by summing the value for each genetic variant weighted by the effect size from the initial GWAS.16 PRS have important applications as research tools, in clinical trials and in clinical practice, as they can facilitate causal inference modeling and genetic risk stratification on an individual level. Despite twin study heritability estimates of around 60%–80% for AD,17 recent SNP-based estimates of common variant heritability of AD from GWAS and PRS are much lower (up to 20%),18 suggesting that much of the genetic contribution to dementia risk remains unexplained. Other approaches are needed to uncover this missing heritability by integrating multi-omics or non-linear modeling.
2.1.2 What problems need addressing?
The diagnosis of dementia and its subtypes is imprecise.19 Current GWAS are based on cases for whom diagnosis of a specific dementia subtype has been largely made based upon clinical signs and symptoms. Thus, although current dementia GWAS are likely to be enriched for pathology related to the dementia subtype of interest, they will inevitably also contain other dementia subtypes and pathologies in their cases. This is problematic since etiology and risk factors are likely to differ for each dementia subtype, so genetic markers with small effect sizes that are specific to a single dementia subtype will be harder to detect than generalized dementia pathways.
There is currently a marked lack of diversity within dementia genetics studies, with GWAS discovery being largely confined to the genetics of AD in non-Hispanic White adults of European ancestry. Although some small GWAS have been conducted in non-European samples,20-23 have measured non-AD dementias,6, 9, 10 and incorporated dementia-related intermediate quantitative phenotypes or endophenotypes (such as amyloid-beta and cerebral small vessel disease),24-26 these studies are largely underpowered. Certain ancestries remain understudied, for example, South Asians despite representing around a quarter of the total global population. Without enhancing diversity in GWAS, or developing appropriate reference panels and genotyping chips, we are unable to construct PRS for all ancestral groups. This perpetuates ethnic bias in future research and clinical practice. We need better methods that can leverage diversity when evaluating risk. Not only from the standpoint of genetics, but integrating multimodal data that may interact with genetic or epigenetic factors as part of comprehensive risk assessment and risk prediction.
The study of both coding and non-coding rare/structural variants associated with dementia risk needs to be further pursued through short- and long-read sequencing technologies, which are thought to be important contributors to missing heritability in dementia.27 Under the hood, long-read sequencing is powered by DL, using GPU-powered alignment algorithms to better characterize the genome. Other potential reasons for missing heritability include unmeasured interactions between genes (epistasis) and failing to account for correlations between genetic variants due to population structure, dynastic effects, assortative mating or functional relationships.28
2.1.3 Possible solutions
Perhaps the simplest way to enhance future GWAS is to further increase sample sizes and the diversity of these samples. This has been the main strategy so far, and has been reasonably successful in identifying additional genetic variants and, to a lesser degree, improving the phenotypic variance explained. It is reasonable to assume that by further increasing sample sizes (essentially more of the same) further discoveries will be made. Increasing sample sizes considerably will involve enhancing existing research studies or establishing new studies. It is also important to consider the existence of different dementia subtypes and how to distinguish them. It may be possible to take advantage of existing well characterized samples that have not previously been genotyped due to resource limitations, such as gold standard post mortem brain bank material with linked clinical data. That said, the cost of new studies which include clinical characterization is likely to remain high, and the number of existing samples is finite, raising practical concerns. Although there is no theoretical upper limit, in practice a predictive accuracy plateau in part limited by heritability is often reached, beyond which additional training data is not helpful. Given the large amount of missing heritability remaining, it is likely that increasing sample sizes may be needed but will not be sufficient in future GWAS, and alternative approaches will be required.29
Leveraging population diversity, rather than omitting it, can both improve statistical power and better detect causal variants. For example, a transfer learning approach was used to enhance the findings from a modestly sized GWAS in a Japanese population using summary statistics from a larger European ancestry GWAS.21 Conversely, trans-ancestry cohorts can also be used to improve genetic variant discovery and localization in European ancestry GWAS. Transfer learning heuristics can also potentially be employed with different rates across global and local admixture levels in some populations for higher accuracy.
As an alternative to the standard linear approaches employed in traditional GWAS, advanced ML approaches may offer various benefits30 (Table 1), including the ability to: (1) capture main genetic effects more accurately; (2) capture multi-scale, non-linear epistatic interactions overlooked when investigating genetic variants individually; (3) better handle trans-ethnic variation; (4) flexibly integrate multimodal (e.g., neuroimaging, clinical biomarkers) and/or multi-omics data; and (5) accurately predict multiple outcomes, such as subtraits, symptoms, and endophenotypes, at once. For example, a gradient tree boosting method followed by an adaptive iterative genetic variant search was used to capture complex non-linear epistatic interactions and select interacting genetic variants with high predictiveness for breast cancer.31 Similarly, improvements have been observed by applying DL to predict survival in age-related macular degeneration32 and reduce multiple testing burden.33 The tool DeepWAS34 was used to identify genetic variants associated with multiple sclerosis and major depressive disorder while simultaneously predicting their cell-type-specific regulatory effects using multi-omics data integration. DeepNull35 is a DL-based tool that models non-linear associations between the phenotype and non-genetic covariates. This improved GWAS hits detection by 6% and phenotypic prediction by 23% on average across 10 different UK Biobank traits, while also substantially reducing the false positive rate. Despite these advances, few attempts have so far been made to apply these techniques to dementia. While early attempts to apply ML-based methods to improve AD risk variant prediction have yet to find substantial improvements over traditional GWAS, the cohorts in which these models have been applied are extremely underpowered,36, 37 leaving ample opportunities to fully leverage ML-based methods on large-scale genomic data.38
Challenge | Use of AI/ML/DL |
---|---|
Multi-scale or non-linear epistatic interactions are overlooked when investigating genetic variants individually through GWAS | ML accurately predicts multiple outcomes at a time/Tree-based methods can be used to capture complex non-linear epistatic interactions and select interacting genetic variants |
GWAS are limited by genetic detection of genome-wide hits | DL models can deal with non-linear associations between the phenotype and non-genetic covariates to improve GWAS hits detection |
GWAS are limited by European ancestry based research | ML models in some cases are better to incorporate trans-ethnic variation and implement transfer learning |
Cell-type effects and specific pathologies are difficult to reproducibly categorize | DL can predict cell-type-specific regulatory effects using multi-omics data integration substantially reducing the false positive rate/DL and computer vision can be used for generating harmonized digital pathology datasets |
PRS are limited by predictive accuracy and hampered by heritability | Novel DL-based model that does not only rely on the addictive effect of risk SNPs, may outperform more traditional PRS models across a variety of disease phenotypes |
Causal inferences are often underpowered and limited in scope | DeepMR125 approaches integrate ML with MR by using multi-task DL models to learn the relationship between different sets of genomic marks associated with a pathway or phenotype of interest and then uses MR to examine causal relationships between them |
- Abbreviations: AI, artificial intelligence; DeepMR, deep Mendelian randomization; DL, deep learning; GWAS, genome-wide association studies; ML, machine learning; MR, Mendelian randomization; PRS, polygenic risk score.
These ML approaches may provide the key to the development of PRS with greater predictive accuracy and specificity.39 However, the degree of improvement offered by ML methods may be partly dependent on the complexity and inter-individual heterogeneity of the genetic architecture underlying the disease of interest. For instance, DeepPRS,40 a novel DL-based model that does not only rely on the additive effect of risk SNPs, outperformed more traditional PRS models across a variety of disease phenotypes, including AD. Thus, we anticipate further improvements in these approaches will unlock some of the unexplained heritability observed in prior GWAS, enhancing future research, trials, and clinical practice.
2.1.4 Examples of best practice
The Global Parkinson's Genetics Program (GP2)41 is in the process of collecting 100,000 European Parkinson's Disease cases, and a further 50,000 cases from under-represented populations around the world. They are primarily achieving this through collaborations and partnerships with researchers and organizations in other countries across the world, highlighting that large collaborative efforts are crucial for success.
Recent work in multi-ancestry PRS is a good first step in the right direction,42 but with larger sample sizes of participant level data, a ML approach could perform well. Lake and colleagues leverage genetically quantified admixture and random effects models in a population with complex substructures using both random-effects derived risk scores and a risk heuristic that leverages the rates of genetic admixture to build a better predictive model.22
2.2 Can we find reproducible omics brain signatures that differentiate between dementia subtypes?
2.2.1 State of the science
Omics technologies have been increasingly applied to human brain samples from individuals with dementia and related neurodegenerative conditions.43-46 Similarly to the GWAS described in the previous section, the largest brain omics studies have focused exclusively on AD. For example, a meta-analysis of the AD human brain transcriptome,47 which using gene expression data from over 2000 samples identified 30 coexpression modules as the major source of AD transcriptional perturbations. Additionally, a meta-analysis of AD epigenome-wide association studies,48 using deoxyribonucleic acid (DNA) methylation data from over 2000 individuals identified 334 differentially methylated positions associated with AD neuropathology across cortical regions. Yet, robust disease-specific omics signatures or signatures shared across diseases are lacking. Neurodegenerative diseases are heterogeneous entities and there is extensive clinical, pathological, and genetic overlap.49 Co-pathologies alongside a dominant condition are frequent (e.g., presence of Lewy bodies in AD patients).50 Cross disease/pathology studies are starting to emerge, for example, addressing epigenetic changes across neurodegenerative diseases,51, 52 and disentangling amyloid-β and tau-pathology-associated transcriptomic profiles in AD.53 However, to find distinguishing molecular signatures we require large well-powered trans-diagnostic cohorts, with a range of primary co-pathologies, and to develop powerful unsupervised ML methods to cluster omics data.54 Although the increasing availability of single-disease datasets has opened the way to meta-analysis and multiple-cohort reanalysis,55-60 much more is needed to assess which mechanisms are conserved across pathologies and which are disease-specific.
2.2.2 What problems need addressing?
It is yet to be understood how and why selective vulnerability occurs in different brain regions and cell types across different neurodegenerative diseases. However, findings from omics studies are often not replicable at the gene/effect level even within a single disease. How then can replicability be enhanced? Several issues need to be addressed: First, studies are often undertaken in small cohorts, which lack statistical power to detect significant molecular changes, and may reflect sampling bias and disease heterogeneity.59 Availability of brain tissue, especially for rare diseases and for matched cognitively normal controls,61 is a limiting factor. Second, phenotype definitions are not unified. The dominant pathology (e.g., AD or Parkinson's disease) is often used as the label, but variable degrees of co-pathologies impact molecular signatures. Instead, multiple pathologies could be combined as a quantitative “polypathology score.” Third, hemispheric asymmetry in neuronal processes is a fundamental feature of the human brain and drives symptom lateralization (e.g., Parkinson's disease and FTD), which is reflected molecularly.62, 63 This interferes with histopathology to omics comparisons, mostly investigated in opposite hemispheres.62 Fourth, genetic variability between individuals is often not accounted for in omics studies. Fifth, there is considerable heterogeneity across studies including differences in brain regions, brain cell type compositions, protocols and platforms to generate the molecular data, and analytic pipelines used. Sixth, the influence of confounding factors, such as batch effects, post mortem interval, or ribonucleic acid (RNA)/DNA quality, can vary substantially between brain banks due to distinct standard procedures.64-66
2.2.3 Possible solutions
Achieving well-powered cohorts will require an escalation in brain donations, especially for control brains. With appropriate funding of brain banks, or through encouraging and funding brain collection in large-scale population studies, this could be achieved. The adoption of standardized procedures across brain banks is crucial to ensure preservation of appropriate and comparable quality tissue for molecular analyses, and allow seamless integration of samples from different banks. Furthermore, omics studies require deep clinical and pathological phenotyping to reduce heterogeneity and to account for covariates in subsequent data analyses.
The ML paradigm may be useful in multiple ways for the identification of reliable and discriminatory brain omics signatures. There is a clear need to integrate omics data generated for samples both from different brain regions and different cohorts, thus enabling the latent space modeling of multimodal brain omics,67 different brain regions, different cell types,68, 69 and different neurodegenerative phenotypes or diseases. This latent space will allow the uniform treatment of samples and a seamless creation of ML models for downstream tasks, such as diagnosis or interpretation.
Multi-omics data in well characterized pathology samples will allow us to refine dementia subtyping. AI can play a huge role in this. DL and computer vision can be used for generating harmonized digital pathology datasets.70 These datasets and samples can then be input into the pipeline for omics characterization. Data from such pathology-based omics studies will be harmonized across sites using a number of unsupervised learning methods. At its core, single cell resolution using tools like scVI71 rely on ML to annotate and quantify cellular components of multi-omics datasets which can then be used for multimodal subtyping at the intersection of genomics and pathology.
2.2.4 Examples of best practice
ML approaches applied to dementia brain omics data, such as epigenomics, transcriptomics, and proteomics data, have started to emerge and illustrate the promise of using such methods to maximize findings from existing data. Huang and colleagues have recently developed EWASplus, a computational method that uses a supervised ML strategy to extend EWAS coverage to the entire genome,38 and implicates additional epigenetic loci for AD that are not found using array-based AD EWASs. Wang and colleagues implemented a DL method that analyzes RNA-seq data from brain donors to characterize post mortem brain transcriptome signatures associated with amyloid-β plaques, tau neurofibrillary tangles and clinical severity in multiple AD and related dementia populations.58 In the proteomics space, Tasaki and colleagues applied a deep neural network approach to predict protein abundance from mRNA expression, in an attempt to track the early protein drivers of AD and related dementia subtypes.72 These approaches demonstrate how such methodologies can be used to identify potential early protein drivers and possible drug targets for preventing or treating AD and related dementias.
2.3 Can high-dimensional omics data identify improved molecular biomarkers for dementia compared to single marker approaches?
2.3.1 State of the science
Technological advances and large, shared, international datasets allow a new approach to understanding diseases including biomarker identification. Single molecule assays, such as Simoa, allow accurate measurement of plasma proteins.73 Notably, plasma neurofilament light (NfL) has been comprehensively shown by many research groups to be substantially increased in a diverse array of neurological brain conditions when compared with age-matched controls, leading to the proposal of NfL being the first established blood-biomarker for neurological and cognitive decline.74 Targeted biomarkers such as NfL have begun to be translated into clinical settings but the use of multi-omics data has so far been limited. However, omics modalities present opportunities for the identification and application of new biomarkers. For example, most dementias appear to have a considerable polygenic component, which present potential as multi-assay risk biomarkers. Genome sequences comprising petabytes of data can be resolved to common single nucleotide variation, rare variants, and structural variants all with potential as markers of disease risk. RNA expression data are currently used in biomarker discovery though not yet achieving the accuracy of blood proteins in disease prediction.75, 76
DNA methylation data can provide a route to identify non-recorded environmental exposures through imputation of these risk factors from published predictors.77 This strategy could help validate epidemiological reports of environmental risk factors and help stratify patients across diagnostic boundaries, which may provide stimuli for additional analyses and clinical follow-up.78 Genes where DNA methylation is altered by specific environmental factors could identify molecular pathways of relevance across dementias. In addition to markers of aging, they have also been used as predictors of cognitive function.79 However, before these markers can be translated to the clinic, they would need to demonstrate stringent accuracy in independent validation cohorts.
While these multimodal datasets described above can contribute to biomarker discovery, many diagnostics companies and regulatory bodies prefer a single readout approach. This is contrary to the basic concept that multimodal data can more accurately reflect complex biological systems.
2.3.2 What problems need addressing?
The development of large harmonized omics datasets is challenging. The first challenge relates to the issue of data quality: high dimensional omics data are acquired from different sources, in distinct formats and over multiple sites, and accompanied by patient medical records. As errors may occur during measurement or processing (i.e., batch effects), they risk potentially compromising the reproducibility and the usability of the generated data. The second challenge is of a computational nature: the preliminary analyses of multi-omics data require a data harmonization process and the development of integration, clustering, functional characterization, and visualization tools. Beyond this step, one of the goals in the biomarker study is the inference and the prediction of biological systems.80 The statistical method traditionally deployed in the inference requires explicit assumptions, which are not necessarily intuitive in the large omics dataset.81 Finally, given dimensionality constraints posed by integrating large multiple omics datasets, the computational burden and storage space requirements can be limiting. The last challenge is to make these datasets sharable and accessible to a large community.82 The development of a large omics dataset therefore requires establishing standardized protocols for the acquisition, transfer, and analysis of clinical and omics data that can be used by the scientific research community.
At its core, the issues with multimodal datasets needed for building the next generation of complex biomarkers is both a wide data and sparsity problem. Studies are simply not large enough, similar enough, or data easily accessible enough to identify better biomarkers which have clinical relevance.
2.3.3 Possible solutions
Recently, ML approaches have made considerable advances in genomics, multi-omics, biomedicine, and data-driven therapeutics discovery.83-85, 39 Application of DL approaches on large scale omics datasets allows researchers to detect new disease relationships with the data. Translating these discoveries into multi-panel tests will be key in applying potential biomarkers. As the costs of omics assays continue to drop, the standard use of high-throughput DNA, RNA, protein, and metabolomics biomarkers in the clinic need to become a reality. Large-scale sequencing initiatives that focus on the genomic underpinnings of neurodegenerative diseases41, 86-90 will aid in the development of more targeted and cost-effective tests such as PRSs and metabolite panels.91 Collectively, these initiatives will enable many opportunities for biomarker identification, validation in both diagnosis and early disease detection, as well as raise important ethical and technical challenges.
In its simplest terms, information theory dictates that adding impactful and independent features to a model should improve its predictability, although limiting analyses to such features may be difficult due to wide data issues in genomics. In ML, facing high dimensionality problems where the number of features is much greater than the number of samples is relatively frequent. That is, why the problem of feature selection has worsened in recent decades.92, 93 In addition, techniques such as federated learning94 are likely to be useful in analyzing biomarkers across datasets that cannot be combined for ethical or practical reasons safely.
2.3.4 Examples of best practice
Analyzing datasets from independent cohorts and then combining them in a meta-analysis can improve statistical power and the ability to detect significant associations. For example, a meta-analysis of 569 lipidomics species measured in the Australian Imaging, Biomarkers and Lifestyle (AIBL) cohort and the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort identified multiple lipids from several species predictive of prevalent and incident AD.95 Within cohort integration of data modalities can also yield novel disease markers, for example, coexpression networks of metabolite and gene expression data from the ADNI cohort identified new metabolite candidate markers.96 The European Medical Information Framework Alzheimer's Disease (EMIF-AD) project (http://www.emif.eu/emif-ad-2/), set up a pan-European platform for large-scale research on biomarkers and risk factors for neurodegenerative disorders. The EMIF-AD Multimodal Biomarker Discovery study harmonized and pooled clinical data from 11 cohort studies and samples from cerebrospinal fluid (CSF), plasma, DNA, and magnetic resonance imaging (MRI) scans were centrally analyzed using different omics techniques (proteomics, metabolomics, and genomics) and integrated analysis has demonstrated the power of such approaches. The Accelerating Medicines Partnership—Alzheimer's Disease (AMP-AD) (https://www.nia.nih.gov/research/amp-ad) allows researchers to access multiple cohorts via a single platform. It is a partnership between government, industry, and nonprofit organizations to transform the current model for developing new diagnostics and treatments for AD. The sharing of multi-omics datasets through this centralized data infrastructure, the AD Knowledge Portal, enables integrative and collaborative analyses to more easily and effectively advance biomarker identification and replication. Improved standardization and harmonization of multi-omics data across silos will benefit the field in the future. In addition, combining multi-omics and clinical data with wearable or other streaming data may yield exciting results such as has been seen in the Parkinson's disease field by Rune Labs’ AppleWatch app (https://www.accessdata.fda.gov/cdrh_docs/pdf21/K213519.pdf).
2.4 How do we use genetics to inform our understanding of causal risk factors?
2.4.1 State of the science
It was recently estimated that reducing modifiable risk factors could prevent around 40% of all-cause dementia cases.97 However, the evidence-base for most hypothesized risk factors being causal is weak, with conflicting findings across studies depending on study design, time of risk factor measurement, type of outcome, sample size and study population.97, 98 Many studies are prone to bias by unmeasured or residual confounding, reverse causation due to dementia's long latency period, and survival bias. Traditionally, randomized controlled trials (RCTs) have been necessary to confirm causal pathways between a risk factor and an outcome. However, these are notoriously challenging for dementia research because it would require monitoring participants over many decades due to the long and ill-defined prodromal period of dementia. In addition, it would be impractical or unethical to conduct an RCT of harmful risk factors such as air pollution and traumatic brain injury. These limitations make it difficult to ascertain which risk factors would be the most useful to target in interventions, and at what point in life such interventions would be most efficacious.
Mendelian randomization (MR) gives us a strong foundation to interrogate the causal status of risk factors. MR overcomes several limitations inherent to observational research, while utilizing more easily accessible cross-sectional rather than prospective data.99 MR uses genetic variants as instrumental variables (IVs) for risk factors in what has been dubbed a natural RCT. Because an individual's genome is assigned randomly at conception, it is largely independent of confounding factors that often cause bias in observational research. The genome also cannot be modified by subsequent disease, making bias due to reverse causation unlikely. MR is a widely used method and can be a useful tool for understanding the etiology of risk factors,100-104 but it also has limitations that should be carefully considered.105, 106 Despite the clear advantages of MR studies, few other methods have been developed that can explore the causal relationships between risk factors and dementia-related outcomes.
2.4.2 What problems need addressing?
There are several common problems that can impact causal inference if they are not duly addressed and can lead to unreliable conclusions being made. Power is problematic in many MR studies examining causality of risk factors on dementia.100 Confidence intervals are often wide, so meaningful effects in either direction cannot be excluded. This is often the case for risk factors that are difficult to measure (e.g., sleep disturbance and physical inactivity).107, 108 Weak instruments (i.e., those with an F-statistic <10) can introduce bias.109 Examples of strong instruments that have been used in MR of dementia risk include plasma glucose,110 educational attainment and intelligence,111 type-2 diabetes mellitus and glycated hemoglobin (HbA1c),112 but these only represent a small fraction of dementia risk factors.
Collider bias can also be introduced into causal analyses when an included sample suffers from selection bias, for example, due to differential patterns of survival associated with the risk factor of interest.113 Individuals need to live long enough to obtain a dementia diagnosis so observed causal effects of any risk factor associated with premature mortality (e.g., smoking) on dementia risk are likely biased.114 Very few studies attempt to identify and, if necessary, correct for survival bias, despite it being demonstrated to produce spurious protective effects in MR studies of causal risk factors for AD and Parkinson's disease.115, 116 Causal analyses may also be biased by population effects that confound the relationship between the genetic instrument and outcome variable (violating the ‘independence’ MR assumption117). Certain dementia risk factors, such as educational attainment, have been shown to be highly influenced by assortative mating (i.e., non-random mating) within populations,117 but this has not yet been systematically assessed in studies of dementia risk factors, so we do not know the extent to which current causal estimates are being biased by these population effects.
Confounding due to horizontal pleiotropy is especially problematic in MR studies that measure the causal association between a complex risk factor (i.e., a phenotype which is highly polygenic) and an outcome. It is becoming increasingly apparent that many SNPs in the genome causally influence multiple traits, making the “exclusion restriction” MR assumption (i.e., that the only path between the genetic instrument and the outcome is via the exposure) less likely to be upheld. In addition, even though many dementia risk factors are genetically inter-correlated118 and co-occurrence of multiple risk factors within an individual increases dementia risk more than being exposed to a single risk factor,119 most studies only measure the causality of one risk factor on dementia. By only measuring bivariate relationships, we are likely overlooking synergistic effects or overlapping causal pathways between dementia risk factors, reducing our ability to identify shared biological pathways that are especially central in raising dementia risk and to characterize the patterns of pleiotropic effects between risk factors. There are methods to disentangle this such as genomic or transcriptomic structural equation modeling-SEM,120, 121 but they require well-powered GWAS, which are not available for all risk factors.
Aside from MR, few causal modeling methods have been developed for use with genetic data. Even in cases where new causal methods have been proposed, such as Bayesian network analysis (BN),122 latent causal variable analysis (LCV),123 and the multi-SNP mediation intersection-union test (SMUT),124 these have not yet been applied in dementia risk factor research and there is a noticeable lack of causal ML modeling in the genomics field.
2.4.3 Possible solutions
One of the key ways that AI methods could be harnessed to improve causal analyses in dementia research is to use ML/DL to strengthen genetic instruments for MR. Traditionally, instruments are created from GWAS summary statistics that are measured using logistic regression and defined p-value thresholds, whereas COMBI28 and DeepCOMBI33 use Support Vector Machines (SVM) and deep neural networks, respectively, to identify SNPs related to a phenotype. Particularly, DeepCOMBI has been shown to replicate known disease loci, as well as identify novel ones. DeepMR integrates ML with MR by using multi-task DL models to initially learn the relationship between different sets of genomic marks (e.g., chromatin marks) associated with a pathway or phenotype of interest and then uses MR to examine causal relationships between them,125 which could help to identify more functionally relevant SNPs for inclusion in the exposure instrumental variable.
Existing methods that quantify and correct for known sources of bias should also be routinely implemented. Automated AI methods could help support this, for example, MR-MoE (MR-Mixture of Experts), which is an ML framework that applies random forest learning algorithms to MR results to identify the method for your analysis that is, least likely to be biased by horizontal pleiotropy.126
Several of the associations between dementia and its risk factors are likely non-linear. For example, the association between sleep duration and dementia is likely to be U-shaped: both too little and too much sleep have been associated with increased dementia risk.97, 127, 128 In this instance, sleep duration is a categorical discrete rather than a truly continuous phenotype, and its genetic instruments are weak in comparison with other risk factors.110 Non-linear MR accounts for non-linearity between continuous exposures and outcomes,129 but it has scarcely been applied to MR studies of dementia risk. One recent study used non-linear MR to assess the causal influence of sleep duration on dementia-related cognitive outcomes.130 Thus, to use MR to understand non-linear relationships between risk factors and dementia, we should focus future GWAS efforts on improving the modeling of continuous risk factors in situations where observational evidence suggests that there is a non-linear causal relationship with dementia.
Room for future improvement includes the potential leveraging of tree-based, boosted, bagged, or other ML algorithms to create interpretable model cascades of causal risk. This could increase the value of previous MR studies while at the same time addressing their shortcoming of generally focusing on only a single exposure at a time. AI has the power to model multiple potentially connected causal risk factors at scale.
2.4.4 Examples of best practice
Recently, a multivariate GWAS was performed using random forest regression to predict causal SNPs for 56 neuroimaging phenotypes, which identified the APOE SNP rs429358 as the top locus as well as additional lead SNPs that mapped to genes relevant to brain disorders, which were not identified by traditional linear regression methods.131 Another study introduced the MR-based Structure Learning (MRSL) algorithm, which used graph theory combined with multivariable MR to uncover causal and mediating pathways between 44 diseases and 26 biomarkers using publicly available GWAS summary statistics.132 Together, these results highlight the potential benefits of utilizing ML-based multivariate approaches to model the genetics underlying inter-correlated risk factor traits when performing causal analyses in dementia research.
Noyce and colleagues previously assessed the impact of survival bias on estimates of the causal effect of body mass index (BMI) on Parkinson's disease.116 They performed simulations to estimate the likely effect that their MR analysis would show if survival bias was present, when assuming that BMI was not truly related to Parkinson's disease. The objective was to see if the likely magnitude of the survival bias was large enough to explain the MR results estimated from the real data. They demonstrated that the seemingly protective effect of higher BMI on Parkinson's disease risk was likely due to survival bias related to increased frailty in people with lower BMI, rather than being the true causal driver. Since effects from survival bias are likely to be especially important for causal analysis of risk factors in dementia research it is crucial that we start to consistently test for this and other common forms of bias in future studies to minimize the impact of spurious findings within our field.
2.5 Which biological processes are altered by genetic risk for dementia-related diseases?
2.5.1 State of the science
Highly penetrant variants in APP, PSEN1, or PSEN2 have pointed to a central role of amyloid-β in early-onset AD.133 Separately, GWAS for late-onset AD identified several biological processes enriched for genes associated with disease risk, including amyloid-β processing, lipid metabolism, and immune responses.134, 135 Although most AD GWAS associations are non-coding, rare coding variants have implicated key microglial genes such as TREM2 and PLCG2.135, 136 Follow-up experiments in cellular and animal models confirmed the effects of these genes on microglial activation and lipid processing.137, 138 Epigenomic maps from purified cell populations139 or single cells140 have localized non-coding AD risk variants to microglia-specific enhancers, regulating genes including BIN1 and RIN3. An alternative way of linking risk variants to genes is to identify quantitative trait loci (QTLs) that influence gene expression, followed by a test for statistical colocalization with nearby GWAS loci. A variation on the previously discussed topic of MR called SMR is often used to establish causal inferences for the function of these QTLs in the context of disease risk on a per gene level. Recent studies in purified microglia from living141 or post mortem142, 143 donors have nominated some AD and Parkinson's disease risk genes, but so far they are underpowered relative to bulk brain datasets. Thus, while genetic studies of AD indicate a clear role of microglia,144, 135, 136, 141, 145 the roles of specific cell types are still being discovered in other neurodegenerative conditions, such as Parkinson's disease139, 146 and amyotrophic lateral sclerosis.147
2.5.2 What problems need addressing?
GWAS for different dementias have so far mainly used a case-control framework to identify genetic loci associated with a clinical diagnosis. However, this approach ignores the complexity of neuropathological changes that occur in patients, which usually predate clinical symptoms by years or decades, and which may involve multiple distinct pathologies.54, 148 The decoupling of genetic associations from specific pathologies makes it difficult to identify the most relevant cellular model for a given locus. In this absence, most cellular models have focused on a single cell type, and thereby fail to elucidate the probable interplay between different cell types that leads to neurodegeneration. Furthermore, identifying and validating the causal genes at GWAS loci continues to remain challenging, due to both the uncertainty in the specific causal variants and the cell types through which they act.149 Additionally, GWAS loci may arise only in a specific cellular state, such as response to a pathology, as has been recently shown for the UNC13A amyotrophic lateral sclerosis/FTD locus.150, 151 As a result, the genes and biological processes that are identified as relevant have depended largely upon the prior hypotheses of investigators and on the cellular models and analysis methods that were used. Although the scale and resolution of single-cell transcriptomic and epigenomic datasets is increasing, there isn't yet a robust and reproducible catalog of all cell types and cell states relevant to brain function and disease processes. Additionally, curated resources cataloging genes involved in many biological processes are often victims of bias due to publication and funding issues as well as reporting bias.
2.5.3 Possible solutions
New technologies have the potential to improve our understanding of neurodegenerative diseases, if applied systematically and at scale. Single-cell technologies are beginning to reveal the cell type diversity of the human brain,152 and to identify cell type-specific gene expression changes in disease.140, 153 The GTEx project154 was transformative in describing gene regulation across human tissues, enabling others to link these genetic effects to human disease risks. However, its sampling of bulk tissues limits its use for understanding biological mechanisms. Single-cell technologies now make it possible to envision a cell type-specific gene regulatory atlas of the human brain. Such an atlas should be built in a robust way across multiple labs, and include both healthy and diseased donors of different ages.
We must also seek to recapitulate the spatial dimension of cell type localization and gene expression. Only by probing gene expression directly in a tissue section can we reliably establish organ-wide patterns of gene expression, reconstruct cell-cell interactions and assess how neuropathology affects local gene expression. Mouse models have highlighted how amyloid plaques influence oligodendrocyte and microglia gene expression across disease stages.155 Going forward, a brain-wide, spatially-resolved gene expression atlas, possibly integrating splicing information,156 would be a rich complement to a standard gene regulatory atlas.
To understand the molecular mechanisms of neurodegenerative disease genetic associations, we need to perturb the function of candidate genes and measure their effects in relevant cellular models. However, an ad-hoc approach in the most accessible cell types will not lead to robust conclusions. With CRISPR-based tools these perturbations can be done at genome-wide scale, in specific cell types derived from human induced pluripotent stem cells (iPSCs), and with high-throughput phenotyping assays. As a community, we should coordinate to systematically investigate a broad set of candidate genes, across multiple cellular phenotypes and in a range of cellular models. Additionally, as part of therapeutic development, these perturbed screens will likely need to be carried out across networks upstream of known targets.
2.5.4 Examples of best practice
For psychiatric disease, the PsychENCODE project set an example by collecting multiple types of omic data from over a thousand post mortem brains across three diseases and three brain regions.157, 46, 158 Crucially, integrative analyses need to leverage these multiple omic layers to generate novel insights, as demonstrated in previous studies of bulk brain.46, 159 Recent studies have used scRNA-seq methods to examine specific brain regions in disease and control individuals for AD,153, 160 amyotrophic lateral sclerosis and FTD,161 revealing cell type-specific effects of disease pathology. For all of these datasets and analyses to be most useful, robust ML methods are needed to integrate distinct omics modalities and to ensure reproducible results. Promising approaches in this direction have recently been applied to large-scale single-cell data from mouse motor cortex,162 and the human immune system.163
As genetic studies of dementias increase in size, so does the need to identify the causal genes at associated loci. New methods enable enhanced fine-mapping using functional genomic data (e.g., PolyFun164), and better prediction of enhancer-promoter connections (e.g., activity-by-contact score). One such example is the identification of USP6NL as the putative causal gene within the AD GWAS locus “ECHDC3” by linking a functionally fine-mapped variant within a microglia enhancer with the USP6NL promoter.142 This finding was further supported by strong colocalization between the GWAS-eQTL. This methodology has also been applied to Parkinson's disease.165 DL models have also shown dramatic improvements in predicting the effects of genetic variants on splicing, pathogenicity (coding variants), and gene expression. Along with experimental data, both variant effect predictions and fine-mapping data can be used as input to ML methods that directly predict the most likely causal genes at GWAS loci.
Beyond cellular maps and genetic associations, a systematic approach to model systems is needed. A National Institutes of Health (NIH)-funded project, the iPSC Neurodegenerative Disease Initiative (iNDI),166 is creating more than 100 isogenic iPSC lines with mutations associated with dementias. How these are used to model neurodegeneration in specific derived cell types will be up to the creativity and vision of the research community.
Clustered regularly interspaced short palindromic repeats (CRISPR) based studies and methods such as perturbSeq and CROPseq have pushed the boundaries of what can be assayed rapidly with edited cell lines.167 These techniques are already being sought after by biotechs looking to quantify up and downstream effects of genetic and genomic therapeutic targets. Enough of this type of data, combined with DL to recognize patterns of functionally connected genes or graph-based network models could identify communities of risk factors that are functionally connected to disease risk.168 These new communities could serve as less biased pathways derived from the appropriate tissues and cell types.
3 LIMITATIONS OF AI AND ML IN THE DEMENTIA OMICS FIELD
High-throughput methods, such the full suite of omics platforms, including genomic, transcriptomic, epigenomic, proteomic, metabolomic, and related technologies, have inaugurated a new era of systems biology. This provides abundant and detailed data, which conventional analytical and statistical approaches are often not capable of dealing with. AI and ML algorithms, which are designed to automatically mine data for insights into complex relationships in these massive datasets, are still at its infancy in dementia genetics and omics research, and far from being explored at its full capacity. Despite major strengths and achievements so far, it is worth having in mind possible caveats of AI models in the omics field, including the following examples: (1) Interpretation (the black box), as often the complexity of certain models makes it difficult to understand the learned patterns and consequently it is challenging to infer the causal relationship between the data and an outcome; (2) “Curse” of dimensionality: omics datasets represent a huge number of variables and often a small number of samples, as mentioned in multiple sections of this paper; (3) Imbalanced classes: most models applied to omics data deal with disease classification problems (e.g., use of major pathology labels in the presence of co-pathologies, as mentioned in section 2.2); and (4) Heterogeneity and sparsity: data from omics applications is often heterogeneous and sparse since it comes from subgroups of the population (e.g., as highlighted in section 2.1), different platforms (e.g., multiple array and sequencing based platforms), multiple omics modalities (e.g., transcriptomics, epigenomics, proteomics) and is often resource intensive to generate. Many of these limitations, however, can be overcomed with improvements to data generation (e.g., larger more diverse harmonizable studies) and analysis (e.g., using dimensionality reduction strategies and interpretable ML approaches).
4 CONCLUDING REMARKS
In conclusion, omics technologies, including genomics, epigenomics, transcriptomics, proteomics, and metabolomics, can provide increasingly comprehensive high-dimensional insights into the biological system of each individual when combined with AI approaches. This in turn can contribute immensely to a better understanding of AD and other forms of dementia, and to the development of personalized medicines. However, a number of thorny issues hamper the use of omics technologies and AI in dementia research. These include the need for better and more comprehensive and less biased genetics and omics dementia-related data resources, the development of improved AI algorithms, and the need for more collaborative multidisciplinary collaboration. Increased funding, a more coordinated collaborative global effort, and a greater number of diverse and deeply phenotyped cohorts, together with innovative AI methods have the potential to overcome these challenges and to increase the pace of discovery that we are able to achieve. Ultimately, this would have a major impact on our understanding of the underlying disease processes and help to improve the prevention, diagnosis, and treatment of dementia.
AUTHOR CONTRIBUTIONS
Conceicao Bettencourt, Nathan Skene, and Sara Bandres-Ciga contributed to the conception of the work, drafting and revision of the manuscript for intellectual content. Conceicao Bettencourt, Emma Anderson, Laura M. Winchester, Isabelle F. Foote, Jeremy Schwartzentruber, and Juan A. Botia contributed to coordinating the writing team, drafting and revision of the manuscript for intellectual content. Mike Nalls, Andrew Singleton, Brian M. Schilder, Jack Humphrey, Sarah J. Marzi, Christina E. Toomey, Ahmad Al Kleifat, Eric L. Harshfield, Victoria Garfield, Cynthia Sandor, Samuel Keat, Stefano Tamburin, and Carlo Sala Frigerio contributed to drafting and revision of the manuscript for intellectual content. Janice M. Ranson and David J. Llewellyn contributed to the conception of the work, conceived and organized the symposium from which this paper and others in the series originated, revised the manuscript for intellectual content, and harmonized the manuscript with other papers in the series. Ilianna Lourida revised the manuscript for intellectual content and harmonized the manuscript with other papers in the series. All authors read and approved the final manuscript.
ACKNOWLEDGMENTS
With thanks to the Deep Dementia Phenotyping (DEMON) Network State of the Science symposium participants (in alphabetical order): Peter Bagshaw, Robin Borchert, Magda Bucholc, James Duce, Charlotte James, David Llewellyn, Donald Lyall, Sarah Marzi, Danielle Newby, Neil Oxtoby, Janice Ranson, Tim Rittman, Nathan Skene, Eugene Tang, Michele Veldsman, Laura Winchester, Zhi Yao. This paper was the product of a DEMON Network state of the science symposium entitled “Harnessing Data Science and AI in Dementia Research” funded by Alzheimer's Research UK. C.B. is supported by Alzheimer's Research UK (ARUK-RF2019B-005) and Multiple System Atrophy Trust. N.S. is supported by the UK Dementia Research Institute which receives its funding from UK DRI Ltd., funded by the UK Medical Research Council, Alzheimer's Society and Alzheimer's Research UK. N.S. also received funding from a UKRI Future Leaders Fellowship (MR/T04327X/1). E.A. is supported by MRC Skills Development Fellowship (MR/W011581/1) and UKRI Future Leaders Fellowship (MR/W011581/1). L.W. is supported Alzheimer's Research UK. I.F.F. is supported by the National Institute on Aging (RF1AG073593). M.A.N.’s participation in this project was part of a competitive contract awarded to Data Tecnica International LLC by the National Institutes of Health to support open science research. J.H. is supported by the NIH National Institute of Neurological Disorders and Stroke (U54NS123743). S.J.M. is funded by the Edmond and Lily Safra Early Career Fellowship Program and the UK Dementia Research Institute, which receives its funding from UK DRI Ltd., funded by the UK Medical Research Council, Alzheimer's Society and Alzheimer's Research UK. A.A.K. is funded by ALS Association Milton Safenowitz Research Fellowship (grant number22-PDF-609. DOI:10.52546/pc.gr.150909), The Motor Neurone Disease Association (MNDA) Fellowship (Al Khleifat/Oct21/975-799), The Darby Rimmer Foundation, and The NIHR Maudsley Biomedical Research Centre. E.L.H. is supported by the Alzheimer's Society (AS-RF-21-017) and the Cambridge British Heart Foundation Centre of Research Excellence (RE/18/1/34212). V.G. is supported by Diabetes UK (15/0005250), British Heart Foundation (SP/16/6/32726) and Professor David Matthews Non-Clinical Fellowship from the Diabetes Research and Wellness Foundation (SCA/01/NCF/22). C.S. is supported by the UK Dementia Research Institute (UK DRI) funded by the Medical Research Council (MRC), Alzheimer's Society and Alzheimer's Research UK, and by the Ser Cymru II programme which is part-funded by Cardiff University and the European Regional Development Fund through the Welsh Government. S.K. is supported by a PhD studentship award from Alzheimer's Society, UK (AS-PhD-19b-014) and the Ser Cymru II programme. J.M.R. and D.J.L. are supported by Alzheimer's Research UK and the Alan Turing Institute/Engineering and Physical Sciences Research Council (EP/N510129/1). D.J.L. also receives funding from the Medical Research Council (MR/X005674/1), National Institute for Health Research (NIHR) Applied Research Collaboration South West Peninsula, National Health and Medical Research Council (NHMRC), and National Institute on Aging/National Institutes of Health (RF1AG055654). This research was supported in part by the Intramural Research Program of the NIH, National Institute on Aging (NIA), National Institutes of Health, Department of Health and Human Services; project number ZO1 AG000535 and ZIA AG000949, as well as the National Institute of Neurological Disorders and Stroke (NINDS). The views expressed in this publication are those of the authors and not necessarily those of the NIHR, NHS, or UK Department of Health and Social Care. This manuscript was facilitated by the Alzheimer's Association International Society to Advance Alzheimer's Research and Treatment (ISTAART), through the AI for Precision Dementia Medicine Professional Interest Area (PIA). The views and opinions expressed by authors in this publication represent those of the authors and do not necessarily reflect those of the PIA membership, ISTAART or the Alzheimer's Association.
CONFLICT OF INTEREST STATEMENT
J.S. is an employee of Illumina Inc. M.A.N. currently serves on the scientific advisory board for Character Biosciences Inc. and Neuron 23 Inc. All other authors declare no competing interests supporting information.