Artificial intelligence for diagnostic and prognostic neuroimaging in dementia: A systematic review
Abstract
Introduction
Artificial intelligence (AI) and neuroimaging offer new opportunities for diagnosis and prognosis of dementia.
Methods
We systematically reviewed studies reporting AI for neuroimaging in diagnosis and/or prognosis of cognitive neurodegenerative diseases.
Results
A total of 255 studies were identified. Most studies relied on the Alzheimer's Disease Neuroimaging Initiative dataset. Algorithmic classifiers were the most commonly used AI method (48%) and discriminative models performed best for differentiating Alzheimer's disease from controls. The accuracy of algorithms varied with the patient cohort, imaging modalities, and stratifiers used. Few studies performed validation in an independent cohort.
Discussion
The literature has several methodological limitations including lack of sufficient algorithm development descriptions and standard definitions. We make recommendations to improve model validation including addressing key clinical questions, providing sufficient description of AI methods and validating findings in independent datasets. Collaborative approaches between experts in AI and medicine will help achieve the promising potential of AI tools in practice.
Highlights
- There has been a rapid expansion in the use of machine learning for diagnosis and prognosis in neurodegenerative disease
- Most studies (71%) relied on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset with no other individual dataset used more than five times
- There has been a recent rise in the use of more complex discriminative models (e.g., neural networks) that performed better than other classifiers for classification of AD vs healthy controls
- We make recommendations to address methodological considerations, addressing key clinical questions, and validation
- We also make recommendations for the field more broadly to standardize outcome measures, address gaps in the literature, and monitor sources of bias
1 INTRODUCTION
There is a pressing need to improve diagnosis and prognosis for people with dementia. Up to 20% of people may receive the wrong diagnosis,1 and differentiating between early symptoms in dementia based on clinical information and neuropsychological testing alone is subjective and prone to error. There is large geographic variability in the likelihood of receiving a diagnosis, even within a single country.2 Diagnostic investigations such as neuroimaging and cerebrospinal fluid (CSF) tests can support clinical diagnosis; however it can take years to receive a diagnosis from the initial onset of symptoms.3 Receiving a timely and accurate diagnosis is critical for people with dementia, their carers, and families:4, 5 it provides the opportunity for forward planning; and with the advent of disease modifying treatments an early accurate diagnosis will guide treatment selection, working toward a precision medicine approach.6
Neuroimaging is a non-invasive investigation used in routine clinical practice to support the diagnosis of dementia.7, 8 A range of neuroimaging methods are used in dementia and magnetic resonance imaging (MRI) is one of the most widely used to examine brain structure,9, 10 longitudinal patterns of atrophy,11 and changes in brain function.12-14 Positron emission tomography (PET) is available in specialist centers and is more expensive; it is used to measure metabolic activity, or using protein-specific ligands to identify underlying pathologies.15-17
Human clinical judgment has traditionally been used to interpret clinical neuroimaging.9 Visual rating scales may support this assessment using features such as medial temporal lobe atrophy18 and white matter hyperintensity load.19, 20 However, the development of more sophisticated approaches and richer data may mean that the most informative features are not amenable to human measurement or observation. For example, resting-state functional MRI can be used to derive a variety of connectivity metrics between 1000s of nodes that are amenable to machine learning (ML) approaches.21 Deep learning methods have also demonstrated superiority to human neuroimaging interpretation.22, 23
ML algorithms facilitate the automation of neuroimaging interpretation and have the potential to reduce bias and improve clinical decision making.24-26 Neuroimaging data are particularly well-suited to analysis using ML, particularly deep learning, given its high dimensionality, non-linear nature and high covariance within the data. A large and growing number of ML studies have investigated how neuroimaging features can be used to predict cognitive diagnoses and conversion to dementia, fueled by the availability of large datasets, such as the Alzheimer's Disease Neuroimaging Initiative (ADNI).27 However, uncertainty remains about which ML approaches have the greatest potential to inform clinical decision making and how their performance compares to human decision making.
We therefore conducted a systematic review to establish: (1) the extent to which ML approaches for neuroimaging have been used for the diagnosis and/or prognosis of neurodegenerative diseases; (2) how this field has progressed over time; (3) methodological challenges; and (4) the future directions to facilitate the translation of ML methods for patient benefit in dementia.
This review is part of a Special Issue on “Artificial Intelligence for Alzheimer's Disease and Related Dementias” published in Alzheimer's & Dementia. Together, this series provides a comprehensive overview of current applications of artificial intelligence (AI) to dementia, and future opportunities for innovation to accelerate research. Each review focuses on a different area of dementia research, including experimental models28, drug discovery and trials optimization29, genetics and omics30, biomarkers31, neuroimaging (this article), prevention32, applied models and digital health33, and methods optimization34.
2 METHODS
We conducted a systematic review to investigate the use of ML methods for diagnosis and/or prognosis in cognitive disorders including Alzheimer's disease (AD), mild cognitive impairment (MCI), Parkinson's disease (PD), vascular dementia, Lewy body dementia (LBD), frontotemporal dementia (FTD), progressive supranuclear palsy (PSP), Huntington's disease (HD) and corticobasal degeneration (CBD). The review is reported according to PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines,35 and the protocol was registered with PROSPERO (ID: CRD42021232249) prior to the screening of abstracts.
2.1 Search strategy
The databases MEDLINE (via Ovid), Embase (via Ovid), Cochrane Library, BNI (via ProQuest), PsycINFO (via EBSCOhost), CINAHL (via EBSCOhost), and Emcare (via Ovid) were searched using the title, abstract, keyword, and MeSH term fields from inception to January 8, 2021, with the support of the Cambridge University Clinical School Library. Results were limited to English language studies. Full search terms for each database can be found in Supplementary Material 1. Studies which were known to the authors and met the inclusion/exclusion criteria of the review, but were not initially identified using the search strategy, were also included.
2.2 PICOS framework
- Participants: Patients with cognitive disorders due to neurodegenerative diseases.
- Index: Neuroimaging data assessed with ML for diagnosis and/or prognosis.
- Comparator: Traditional manual/subjective diagnostic/prognostic assessment.
- Outcome: Accuracy of diagnosis and/or prognosis.
- Study design: Controlled study.
RESEARCH IN CONTEXT
-
Systematic Review: We conducted comprehensive searches of MEDLINE, Embase, Cochrane Library, BNI, PsycINFO, CINAHL, and Emcare to identify studies that examine the potential of artificial intelligence (AI) and machine learning methods applied to neuroimaging to inform clinical diagnosis and prognosis in dementia and other neurodegenerative diseases.
-
Interpretation: The use of AI in neuroimaging is expanding rapidly with the evidence base being dominated by studies conducted using the ADNI dataset, algorithmic classifiers, and structural MRI focusing on Alzheimer's disease. Improved diagnostic accuracy was observed when a combination of neuroimaging modalities was used, e.g., PET and structural MRI. Findings also suggest superior performance of discriminative models compared to algorithmic and generative classifiers for the classification of Alzheimer's disease vs healthy controls.
-
Future Directions: We highlight gaps in knowledge, current challenges, and issues to be addressed in future research around reproducibility and reporting, relevant clinical questions, and validation of results. We advocate wider collaboration between clinical, neuroimaging, and data science teams, and present recommendations to move toward clinically useful, machine learning methods applied to neuroimaging for dementia.
2.3 Inclusion & exclusion criteria
The inclusion and exclusion criteria used during the screening process to determine which studies would be included in the systematic review can be found below:
-
Primary research studies only.
-
Patient population consisting of AD, MCI, PD, vascular dementia, LBD, FTD, PSP, CBD, HD, and/or all-cause dementia.
-
Involving at least one of the following neuroimaging or neurophysiological modalities: structural or functional MRI, PET, single-photon computed tomography (SPECT), electroencephalogram (EEG), magnetoencephalography (MEG), or ultrasound.
-
Used ML methods to investigate diagnosis and/or prognosis of cognitive neurodegenerative disease(s).
-
Studies which did not include human participants.
-
Studies published in languages other than English.
-
Conference abstracts and book chapters.
-
Articles which did not include primary research, for example, reviews.
-
Studies where access to the full text was not available despite attempts from multiple individuals involved in the screening process.
-
Studies which did not use ML methods or only used simple logistic or linear regression methods for classification.
-
Studies which combined neuroimaging with other biomarkers, including CSF markers and/or genetics data, in the ML algorithms without reporting of model performance for neuroimaging features without these additional biomarkers.
-
Studies which focused on automated segmentation techniques which did not directly relate to diagnosis/prognosis of neurodegenerative diseases.
-
Studies which used AI methods for feature extraction but not classification.
2.4 Study selection
The initial records were identified using the search criteria. These records underwent de-duplication using a Zotero (https://zotero.com) automation tool, which flagged possible duplicate studies, and were manually screened by a reviewer to merge genuine duplicates. Following de-duplication, all studies were screened across two stages. During the first stage, each abstract was independently reviewed by two reviewers to determine their eligibility for inclusion based on the outlined criteria using the screening tool Rayyan (https://www.rayyan.ai/). Once both reviewers screened their allocated abstracts, inclusion/exclusion decisions were unblinded. For abstracts where there was disagreement between screeners, a third independent reviewer assessed the abstract and made the final decision as to (1) progression to full-text screening stage or (2) exclusion.
The second stage involved full-text screening of all included studies by one reviewer per paper. For studies where the reviewer was unsure if the study met the outlined criteria, a second opinion was sought and a joint decision made after discussion with the second reviewer.
2.5 Data extraction
-
Article information: First author, year, journal, country of first author's affiliated institution.
-
Study method: Patient population(s), neuroimaging modality, source of data. For studies using different datasets relating to a study, information regarding which specific dataset was extracted where possible. For example, for ADNI studies, the specific dataset used (ADNI-1, ADNI-2, ADNI-GO, J-ADNI) was identified and recorded where available.
-
ML methods, extracted neuroimaging features.
-
Receiver-operator curve (ROC) analysis results from the ML algorithm used to predict diagnosis/prognosis in the patient population, including accuracy (ACC), sensitivity (SEN), specificity (SPE), area under the curve (AUC), positive predictive value (PPV), and/or negative predictive value (NPV).
2.6 Risk of bias assessment
Following the second stage of screening, all included studies were assessed for risk of bias by one reviewer using a hybrid version of the Joanna Briggs Institute (JBI) Critical Appraisal checklist covering the areas we deemed most relevant to this area of research.36 The specific questions used for risk of bias assessment and their outcome for each study can be found in Supplementary Material 2. We only excluded studies exhibiting clear methodological concerns, such as lack of reporting of basic participant demographics, in order to accurately depict and identify current barriers in the literature limiting translation to clinical practice.
2.7 Data synthesis and approaches to classification
We used descriptive statistics to determine the following characteristics of the extracted dataset: source of neuroimaging data, type of neuroimaging used, ML methods, focus on diagnosis and/or prognosis, accuracy of diagnostic/prognostic classifications, and global distribution of first authors’ institutions. Studies using MRI were labeled according to the types of features used for the classification task including volumetric structural, non-volumetric structural, and functional MRI. Volumetric structural imaging was defined as MRI methods measuring the volume of specific regions using voxel-based segmentation techniques. Studies were classified as using non-volumetric structural MRI if the features used for classification were related to cortical thickness, texture, or surface area using T1- or T2-weighted images and/or diffusion tensor imaging (DTI) data. The type of AI algorithm used for the diagnostic/prognostic classification task was extracted. Studies which used AI methods for feature extraction but not classification were excluded.
-
Generative classifiers learn the joint distribution of the features and labels.37 Examples include naïve Bayes and linear/quadratic discriminant analysis. After training, it is possible to generate (hence the name) new pairs of features and labels by sampling from the learned joint distribution.
-
Discriminative classifiers learn the conditional distribution of the labels given the features.38 Examples include logistic and Gaussian process regression with potential regularization, k-nearest neighbors, and most ensemble methods (such as random forests).
-
Non-probabilistic, algorithmic classifiers directly learn the decision boundary in feature space.39 Examples include maximum margin classifiers and support vector machines.
We note that some non-probabilistic classifiers can be reframed in a probabilistic light.40, 41 For this reason, some authors consider these methods to be discriminative in nature and draw less of a distinction between our types (2) and (3).
In order to determine how well a classifier generalizes to new data, models are typically evaluated using a validation set consisting of labeled data withheld from the training process. The model's predictions in the validation data can be compared to known labels using a variety of different metrics; precision, recall, accuracy, AUC, and F-scores are all estimated in this way. If a classifier performs much better on training data than on validation data, this can indicate overfitting. In such a case, the model may be refitted with regularization terms or priors that penalize model complexity.
Following data extraction, we conducted a meta-analysis. Considering the large number of studies from a single cohort and significant overlap of datasets, there is a risk of identifying spurious associations and false-positive findings when running a comprehensive meta-analysis.42-44 We attempted to overcome these barriers by running a focused evaluation of the performance of ML algorithms, measured with AUC values, for a specific task: classification of AD versus healthy controls. This was achieved using a Stratified Weighted Random Method (SWRM) approach by assigning weights to the datasets and features (see further methodological details in Supplementary Material 1).
3 RESULTS
The initial search strategy yielded 2709 studies, which underwent abstract screening following de-duplication. Three additional studies which were not picked up in the initial search strategy but met the inclusion criteria were identified by experts in the field and underwent full-text screening. The studies were consolidated to 255 studies after full-text screening (full list of references in Supplementary Material 3). A flow chart of the screening process reported according to the PRISMA 2020 guidelines35 is shown in Figure 1. The publication time period ranged from 2005 to 2021. The included studies were classified by country based on the institutional affiliation of the first author. The most common countries included China (26%), USA (17%), Italy (7%), France (6%), and South Korea (6%).

Risk of bias assessment resulted in exclusion of three studies which exhibited clear methodological concerns, such as lack of reporting of basic participant demographics (supplementary material 2). The majority of studies used clearly defined inclusion criteria (95%) with detailed descriptions of participants and settings (91%). Only 41% of studies explicitly identified potential confounding factors.
3.1 Datasets
Few studies used more than a single dataset, with 233 studies using one dataset, 18 used two datasets, and the remainder used three or more datasets. The most commonly used dataset was ADNI (see Figure 2). In the majority of the studies using data from ADNI, the specific cohort used (ADNI-1, ADNI-2, ADNI-GO, J-ADNI) was not stated (129 of 181) (Table 2 in Supplementary Material 1). Where the cohort was available (n = 52), 36 (69.2%) studies used a single cohort, 8 (15.4%) used two cohorts, and 8 (15.4%) used three cohorts. Of those that used ADNI-2 and ADNI-GO (n = 11), a majority (n = 9) also used ADNI-1. Apart from using the ADNI dataset alone, 19 studies used data from ADNI combined with other datasets including the UK Biobank and AIBL. The majority (n = 11) of these combination studies used a local dataset in addition to the ADNI dataset.

3.2 AI methods
The classifier type most frequently used was a non-probabilistic algorithmic approach (48%), an example of which is support vector machines (SVM), followed by discriminative classifiers (32%) which includes most neural networks. Generative classifiers and “other” methods, mainly consisting of studies which combined multiple AI algorithms to generate novel or complex classification tools were difficult to categorize; each constituted 10% of the literature. Most of these studies focused heavily on computational methods which are not easily accessible to a clinical audience.
The number of studies which used algorithmic classifiers (mainly SVM) increased considerably between 2013 and 2015, after which its use stabilized. In contrast, there was a sharp rise in the number of studies using discriminative approaches (e.g., neural networks) starting in 2017, with discriminative studies outnumbering algorithmic studies for the first time in 2019 (Figure 3).

In order to unveil potential differences in performance between ML methods, we examined AUC values for classifying AD versus healthy controls across studies (Figure 4). Of note is that only 13% (11 of 84) of these studies reported a confidence interval for the AUC value. Of these 11 studies, 5 did not report the range of the confidence interval (e.g., 90% or 95%).

We employed a meta-analytic approach using the stratified weighted random method (SWRM) to weigh results based on the dataset, imaging modality, and type of ML method used (methodological details in the Supplementary Material 1). We found that for classification of AD versus healthy controls (i) discriminative models (SWRM = 3.39, RSD = 0.948, Heterogeneity = Considerable) performed better compared to algorithmic (SWRM = 2.42, RSD = 0.758, Heterogeneity = Substantial) and generative (SWRM = 2.14, RSD = 0.784, Heterogeneity = Moderate) classifiers; and (ii) each R table expected to have 49 rows but has in the range of 6-8, which indicates that most of the literature was limited to only few datasets and imaging modalities.
We identified four studies which used transfer learning for classification45-48 which were trained on ImageNet45 ADNI (normal controls and AD),46 Human Connectome Project (HCP),47 and generic images,48 and were transferred to ADNI,45 ADNI (stable and progressive MCI),46 ADNI,47 and ADNI (sMRI).48 Transfer learning was typically used for fine tuning neural networks, particularly when the authors felt the dataset was not sufficiently large enough to properly train the neural network algorithm. Accuracy varied between these studies, including for the following classification tasks: AD versus healthy controls (90.4–99.1), MCI versus healthy controls (83.2–99.2), and MCI converters versus non-converters (70.6–81.6).
3.3 MRI
The number of imaging modalities used across the included studies can be found in Figure 5. Structural MRI and PET/SPECT were the most frequently used imaging modalities for diagnosis and prognosis of dementia, being used in approximately 71% and 25% of studies respectively. Around half of studies leveraged structural MRI alone (134 of 255) and those making use of multiple modalities (49 of 255) often used sMRI and PET (35 of 49) together. It is only since 2020 that studies incorporating three or more different modalities have begun to appear.49-51

In total, 68.6% (175 of 255) of studies relied on volumetric structural MRI measurements. In the few studies that tested traditional and AI approaches head-to-head, AI methods outperformed raw volumetric measurements, for example, against hippocampal volume for diagnosis52, 53 and for predicting conversion of MCI to AD.54 The reported accuracy of AI methods for the diagnosis of AD varied between 60.2% and 99.3%. Of note, estimates in the lower range were found when using a multi-class classifier (i.e., AD vs. MCI vs. healthy controls, rather than AD vs. healthy controls)55, 56 or where an independent validation group was used.57
Contributing to heterogeneity, the aim of “diagnosis” differed between studies using structural MRI. For example, there were 17 studies specifically targeting early diagnosis in which “early” disease was variably defined by: MMSE score < 2458-60; CDR 0.5-148, 61-63; progression from MCI to AD within 18 months,64, 65 2 years,66 3 years67, 68; conversion more than 12 months after imaging69; or was not clearly defined.70-72
Studies using longitudinal structural MRI measures (n = 6)69, 73-77 suggest that multiple timepoints may be more accurate than baseline measures alone for the diagnosis of AD,62 and were particularly useful when applied to the prediction of MCI to AD conversion.69, 75, 77 Of interest, longitudinal changes in volumetric MRI may need to be considered in the context of baseline volumetry to be meaningful.74
Twenty-eight studies investigated the use of non-volumetric structural imaging features for diagnosis (n = 24) and/or prognosis/conversion (n = 7). The input consisted of T1- or T2-weighted images, DTI data, or a combination thereof, to estimate non-volumetric features such as cortical thickness, texture, and surface area. These studies focused on (i) optimization of image pre-processing techniques, (ii) investigation of feature selection methods, and (iii) optimization of classifiers and subsequent validation of the developed method. The accuracy for differentiating between AD patients and healthy controls ranged from 79.2% to 99.1%. Promising developments were noted for differential diagnosis (e.g., vascular dementia vs. AD)78 and early diagnosis distinguishing MCI and healthy controls.79-82 As expected, differentiating MCI subtypes and between MCI and AD cohorts was a more difficult task, which is also often the case in clinical practice. We found that performance was lower when predicting MCI conversion to AD, or conversion of stable MCI to progressive MCI.83-85
Twenty-six studies (the first published in 2012) used resting-state MRI (rsMRI); we did not identify any studies using task-based MRI. All but 4 studies51, 52, 86, 87 focused on diagnosis and the majority (20 of 26) used ADNI data, either as the primary dataset or as a replication dataset. Graph measures were often used to summarize network characteristics. Overall, the accuracy of discriminating between AD and controls ranged between 85% and 97%, but dropped when discriminating between MCI and controls (70-88%). Most studies reported the nodes which contribute most to discrimination between AD and controls: there was some heterogeneity, but most often components of the default mode network (DMN) were identified.88-91
3.4 Neurophysiological imaging
We identified 24 studies which used neurophysiological imaging methods, only three of which investigated non-AD neurodegenerative diseases including PD and FTD.92-94 The majority of the studies (n = 21) used quantitative EEG, while the remaining used either MEG,95 event-related potential EEG96, or combined EEG with SPECT.97 Although half (n = 12) of these studies have been published since 2018, this cohort of publications also included some of the earliest studies identified in this review starting in 2005.98, 99 All neurophysiological studies used data from their local institution, the largest of which included EEG recordings from 272 participants,100 although most studies (n = 13) included less than 50 participants. In a manner similar to other imaging modalities, SVM was the most common (n = 12) ML tool used and no other algorithm was used in more than three studies. Accuracy of discrimination between AD and healthy controls varied from 69% in the single MEG study95 up to 100% in one study using four EEG features.101
3.5 PET/SPECT imaging
Sixty-five studies were identified using PET imaging, aiming to improve early diagnosis (n = 46), prognosis (n = 13), or both (n = 6) using ML approaches. The most commonly used approach was SVM (n = 27), which when applied to FDG PET, demonstrated an accuracy of over 85% in studies for detecting AD hypometabolic patterns102-104 and outperformed structural MRI when compared head-to-head.105, 106 Using SVM with FDG PET data distinguished AD (>86% accuracy) and MCI (>78.8% accuracy) from controls and predicted MCI conversion within 12 months and up to 5 years with accuracies ranging from 72% to 80%.107-118 The same approach applied to amyloid PET also demonstrated accuracies of >85% for predicting MCI conversion and diagnosing AD.115, 117, 119-121 Non-SVM approaches, such as convolutional neural networks and deep learning, on FDG PET and amyloid PET showed variable performance in predicting a final diagnosis of AD, cognitive decline, or MCI conversion,46, 122-132 with accuracy between 75% and 100%. Model accuracy in multicenter studies (>70% accuracy) was lower than that of those relying on local datasets (>78% accuracy).
Compared to ML methods which used PET alone, those which combined imaging modalities (i.e., FDG PET, amyloid PET, and/or MRI) were more accurate in terms of diagnosis of both MCI and AD (min accuracy: 56% for PET alone vs. 72% for PET and other modalities).109, 115-117, 120, 133-135 An additional approach used PET and structural MRI data in combination with other markers (i.e., apolipoprotein E4 [APOE4] status and cognitive scores) to train a classifier, then selected neuroimaging features for classification, showing better performance when neuroimaging data (gray matter density, amyloid burden, APOE4 status; r = −0.68) were used to predict individualized rate of cognitive decline in MCI, compared to cognitive predictors (depression, memory and executive function scores; r = −0.4).136 Similarly, three studies showed that SPECT is able to classify MCI and AD, but its predictive value for MCI conversion improved when combined with other imaging modalities or cognitive assessments.97, 137, 138
3.6 Approaches to prognosis in AD
Fifty-four studies investigated either prognosis or a combination of diagnosis and prognosis. The majority were retrospective designs (51 of 54). Of 54 studies, 47 (87%) looked at prognosis in terms of MCI to AD conversion. Of these studies, two approaches were used to evaluate the performance of prognostic predictions; some exclusively used baseline data (fixed), while others used multiple imaging time points (continuous) and related these to time to conversion.
MRI alone was the main imaging modality used (36 of 54 studies) with an additional six studies combining MRI and PET. Nine studies used only PET data,102, 111-113, 116, 122, 126, 139 one used SPECT,137 and two used EEG data.93, 95 The main outcome measure for these studies was conversion to AD from MCI over a prespecified period of time (47 of 54 studies). A smaller proportion of studies (n = 4) used cognitive decline as an outcome measure. Similar to the diagnostic studies discussed in this review, the majority of the neuroimaging data came from the ADNI database (78%, 42 of 54 studies). An additional three studies combined local datasets with ADNI.
Thirty-eight studies used only baseline imaging data to predict a future diagnosis with a range of accuracy between 65% and 96% (mean AUC 0.79, standard deviation 0.09). Seven used multiple imaging time-points to make predictions with accuracies between 73% and 92% (mean AUC 0.81, standard deviation 0.10). One paper found a substantial improvement with longitudinal data (AUC 0.93) compared to baseline data alone (AUC 0.54),111 and a second paper achieved a high level of accuracy using baseline neuroimaging information with longitudinal cognitive scores (AUC = 0.90).70
Time to conversion was divided into two categories: conversion within a fixed timeframe (42 of 47), or a continuous measure of time of conversion (5 of 47). Of those that used a fixed timeframe, 5 studies considered conversion within 1 year (AUC range: 0.72-0.90), 8 studies within 18 months (AUC range: 0.68-0.79), 5 studies within 2 years (AUC range: 0.74-0.96), 17 studies within 3 years (AUC range: 0.65-0.93), and 7 studies predicted conversion over 3 years with a maximum of within 10 years (AUC range: 0.54-0.91).
The main outcome of the remaining studies that did not focus on MCI to AD progression (7 of 54) varied; 2 of 7 predicted cognitive scores (Alzheimer's Disease Assessment Scale—Cognitive Subscale [ADAS-Cog]) over time using longitudinal MRI,140, 141 while 2 other studies predicted both cognitive scores (Mini Mental State Examination [MMSE]) and MCI to AD conversion within 24 months.77, 142 Additionally, two of seven studies predicted conversion from cognitively normal to AD in 759 and 2 years.64, 143 Finally, only one paper examined prognosis in non-AD neurodegenerative diseases, namely PD and DLB93 with an AUC of 0.87.
3.7 Non-Alzheimer's dementias
The majority of studies that included patients with non-Alzheimer's dementia used neuroimaging features to improve the differential diagnosis between different dementia diagnoses. In total, 17 studies included a non-AD dementia group, 14 featured a non-AD dementia as the diagnosis of interest, with the remaining 3 using the non-AD groups as a control group. FTD or behavioral variant FTD (bvFTD) was the most commonly investigated non-AD dementia, with seven studies having FTD or bvFTD as their main focus.92, 93, 144-148 These studies attempted differential diagnosis of FTD (from AD and/or LBD) most often using neuropsychological data and structural imaging (four of seven studies), with two studies using EEG92, 94 and one using structural MRI for classification based on post-mortem pathology.147 Five studies used data routinely collected in clinics (for example, from memory clinics) to attempt differential diagnosis between patient groups based on imaging features and typically included FTD, LBD, PSP, CBD, PD dementia, and vascular dementia.53, 149-152
Structural MRI was the most frequently used imaging modality (11 of 17 studies). Two studies focused on the differential diagnosis between PD and LBD,93, 153 and only two on vascular dementia.78, 154 The majority of studies used data from local hospitals or memory clinics (14 of 17 studies); one paper used local data combined with ADNI,57 and three studies used multi-center or cohort data.144, 148, 150 Since the majority of studies utilized prospective or retrospective data from local clinics, datasets were relatively small compared to multi-center studies like ADNI with most studies including 60 to 100 patients and some as low as 15 patients in a single diagnostic category.78 The studies with larger patient numbers tended to come from multi-center studies144, 150 or used retrospective data over a long period of time.147
4 DISCUSSION
In this systematic review, we examined 255 published studies using neuroimaging alone for the diagnosis or prognosis of neurodegenerative disease. The vast majority of studies (71%) used the ADNI dataset which primarily uses MRI and focuses on the conversion from MCI to AD. The dominance of ADNI means that this emphasis is reflected in the published literature, with the majority of studies using structural MRI alone or in combination with another MRI modality or PET, almost all of which focused on AD. The size of the ADNI data has led to a rapid rise since 2017 in the use of more complex discriminative AI methodologies, including deep learning models. These more complex models have in general outperformed simpler algorithmic and generative models, although comparison between studies is challenging given differences in diagnostic criteria and outcome measures. Most studies of diagnosis published ROC curve analysis results; however, there were marked differences between studies in definitions such as “early” dementia, and in the outcome measures used in prognostic studies. There remain significant gaps in the literature including non-Alzheimer's neurodegenerative diseases (most strikingly vascular dementia with only two studies), the limited application of promising neurophysiology methods, and validation in clinically relevant populations.
ML methods have been successfully applied to almost every aspect of neurodegenerative disease.155 A previous review of ML for neuroimaging in dementia included studies up to 2016,42 since when the field has expanded rapidly. Approximately 60% of the studies we included (n = 152) have been published since 2016. Some progress has been made on the concerns raised by Pellegrini and colleagues, including the overreliance on SVM classifiers and MRI. SVM was still the most frequently used classifier in our cohort which is unsurprising given that it was one of the first widely adopted methods. However, the overreliance on SVM classifiers has reduced, reflecting the rapid growth of this field and moving toward the use of a range of ML methodologies, as well as PET and/or multimodal approaches. However, despite this surge in studies, several barriers prevent the integration of these novel methods into everyday clinical practice. Below we discuss three critical issues identified from this systematic review: (1) reporting and reproducibility of methodology, (2) addressing clinically relevant questions, (3) validation of results.
4.1 Methodological considerations
While it is encouraging to see a wide range of methods applied to neuroimaging data, the multiplicity of approaches creates a challenge in assessing the validity of each method, comparing between differing models, and independently reproducing the results. Although we did not systematically review reproducibility, in general we found limited descriptions of many models, and only a minority of studies reported the availability of code to enable replication.
Reproducibility and transparency in neuroimaging research is an increasingly prominent issue, most clearly outlined by Poldrack and colleagues.156 The neuroimaging field has led the way in open science efforts, such as large data sharing platforms pioneered by the Human Connectome Project,157 and introducing best practice for analysis and data sharing through the COBIDAS guidelines.158, 159 To increase the reliability of results, pre-registering analysis through platforms such as the Open Science Framework160 has been advocated for in both neuroimaging studies161 and ML methodologies.162 More generally, staged approaches to model validation in ML are available to improve confidence in model performance.25
We found that the combination of multiple imaging modalities, such as MRI and PET, improved the performance of ML models for classification tasks related to AD. We speculate that using features from multiple modalities enables the models to train on several different biomarkers which provide a more holistic representation of the underlying disease mechanisms, such as changes in structure (volumetric MRI), network-connectivity metrics (resting-state fMRI), and metabolic physiology (PET). Although the results suggest this approach may be beneficial, the limited number of studies identified here using this method means that it is difficult to suggest which combinations of modalities will be best at improving the performance of ML models.
4.2 Addressing key clinical questions
Relevant clinical questions can be split into early diagnosis, differential diagnosis, prognosis and predicting response to treatment. There were no studies investigating the response to treatment, perhaps unsurprisingly given that the currently widely available treatments for dementia are symptomatic rather than disease modifying. The majority of studies considered the diagnosis of AD, or the prognostic prediction of MCI conversion to early AD. However, variability in definitions such as “early Alzheimer's disease” limited comparison between studies. This partly reflects the wider field where, for example, a clear definition of MCI has remained elusive despite recent efforts to reach such a consensus.163
We found no studies that assessed the common clinical challenge of differential diagnosis from among multiple (>2) possible diagnoses. This is a much harder problem to solve for ML algorithms because it requires a multi-class classifier which is computationally more challenging and typically yields lower accuracy than a binary classifier. The lack of appropriate multiclass data is a major limitation, particularly given the reliance on the ADNI dataset that consists almost exclusively of amnestic MCI or AD patients. The National Alzheimer's Coordinating Center dataset has Alzheimer's and non-Alzheimer's dementia patients from a real-world setting,164 but is much more variable in scanning sequences (including MRI field strengths), and reports clinically defined diagnoses rather than research diagnostic criteria.
ROC curve analysis was widely used to characterize diagnostic classification performance. In particular, we found the AUC is often reported as the main measure of classification between groups, usually accompanied by the PPV and NPV. The PPV and NPV are more relevant to clinical practice, providing interpretation of the proportion of correct positive and negative results for a classification. The outcome measure for prognostic studies is more challenging. We found that studies predicting prognosis usually grouped outcomes and applied ROC curve analysis. This is particularly relevant for predicting MCI to AD conversion; however, it is not applicable to other situations, such as predicting the rate of cognitive decline in established dementia.
4.3 Validation of results
We found that studies using an independent dataset for validation, as opposed to cross-validation or other similar methods, reported much lower accuracy, particularly when a community-based population was used. For instance, applying an SVM classifier trained on ADNI and applied to memory clinics found markedly reduced accuracy in the clinical setting (AUC = 0.76 for AD diagnosis) compared to that in the training dataset (AUC = 0.96).57 A few recent studies have addressed the risk of overfitting by assessing generalizability in unseen independent research datasets,104, 165, 166 collectively demonstrating the value of this approach in identifying methodological issues relevant to the overall model performance. Therefore, validation studies are critical, particularly those in a memory clinic setting where the tools are ultimately to be used.
The over-reliance on a single dataset such as ADNI introduces potential ethnic and socio-economic biases to models that may hamper generalization, an issue that has been specifically raised in the ADNI dataset.167 Concerns have been raised more generally about bias in ML models,168 including in the context of health applications.169 This is of particular concern in marginalized ethnic groups who have poorer health indicators in general,170 and who may miss out on access to health services due to socio-demographic, cultural, or religious beliefs,171 including dementia services.172, 173 More representative datasets are critical for models to translate reliably to all parts of the population, to inform risk prediction models, and work toward closing gaps in health inequality related to dementia. Addressing bias in these collected datasets, and differences between genetic or ethnic groups in model performance, or applicability to different socio-economic populations, will be critical to address in ongoing data collection. It is unlikely that a single study or a single dataset can properly address these challenging issues, so collaboration between studies and between countries is required. This is happening to some extent in initiatives such as J-ADNI in Japan which is almost identical to the North American protocol and has been used to compare diagnosis and progression in dementia between both cohorts.174 Other examples include the Longitudinal Aging Study in India (LASI-DAD)175 and through initiatives such as the Genetic Frontotemporal dementia Initiative (GeNFI),176 which recruits multi-nationally. Federated learning may also help address this issue by providing broader accessibility to datasets from diverse backgrounds and international sources.
A number of methodological approaches are available for measuring or mitigating bias.177 Examples include the geometric solution to learn fair representations (He et al. 2020),178 which removes correlations between the data and specified protected features, as well as IBM's AI Fairness 360 toolkit (Bellamy et al. 2019),179 which provides an accessible set of fairness metrics for a model and accompanying explanations to help mitigate bias. We did not find the issue of bias to be discussed or addressed in the studies we reviewed.
4.4 Challenges for the field
Some of the issues we have highlighted can be addressed by individual researchers, but others require engagement from the neuroimaging, ML, and clinical communities more generally. This kind of collaboration has proven successful in initiatives such as ADNI. Although ADNI is a powerful dataset and has facilitated the use of more complex methodologies, similar collaborations for data collection and curation are required to help address ML for non-Alzheimer's neurodegenerative disease, and for EEG data.
Given the challenges of comparisons between studies using different methodologies and definitions, we suggest the field move toward consensus on outcome measures. Diagnostic criteria exist for the major neurodegenerative disorders, but better definitions of ‘early’ disease, and standard methods to assess prognosis would facilitate model selection. We outline our recommendations in Box 1.
BOX 1: Recommendations to move toward clinically useful, machine learning methods applied to neuroimaging for dementia
Recommendations for machine learning studies
Methodological considerations
- Provide sufficient description of the methods, with available code, to enable independent replication
- Use a staged approach to model validation
- Pre-register analysis
- Consider using multiple modalities
Addressing key clinical questions
- Clearly state the diagnostic criteria used
- For diagnosis, report performance in terms of ROC curve analysis, including PPV and NPV, and confidence intervals
- Clearly define measures of prognosis, and consider the use of odds ratios and survival analysis
Validation
- Independently validate models in at least one independent dataset
- Validate findings in a real-world dataset (e.g., memory clinics)
Recommendations for the field more broadly
- Work toward consensus on outcome measures for diagnosis and prognosis
- Establish large datasets of non-AD and/or multiple types of dementia
- Establish open datasets for EEG comparable to those with MRI
- Monitor ethnic and sociodemographic bias in data collection and encourage cross-study collaboration to address these biases
In addition to overcoming these barriers related to transparency, establishing large, diverse datasets, external validation and consensus definitions, we will also need to address translational challenges more broadly to implement AI into real-world clinical settings.180 Overcoming the technical obstacles of integrating AI will be required for different types of bias/artifacts when data are conglomerated from various sources/institutions181 while ensuring the security and privacy of sensitive health records for storage and sharing.182 Several factors currently limit the adoption of AI tools by clinicians including identity threat,183, 184 disruption of clinical workflow, and the uncertainty surrounding the basis of “black box” algorithms, particularly when the output disagrees with their own clinical judgement.185 By improving interpretability, explainable AI may be the most amenable approach to building trust and understanding in the medical profession.186 Furthermore, social and legal issues will require significant attention if implementation of AI into clinical practice is to be successful. For example, there remains uncertainty about which party is responsible when the use of AI tools result in harm from both legal187 and patient188 perspectives, while patients in general may prefer human supervision over AI.189
4.5 Limitations
This systematic review has three main limitations. First, although we aimed to provide an informed and broad overview of the existing literature on this subject, our exclusion of reports not written in English and those where the full text was not available meant that some studies which would have otherwise met the inclusion criteria may not have been covered in this review. Two key additional exclusion criteria were the decisions not to include studies using linear regression for classification, and studies combining neuroimaging with other biomarkers without reporting the model performance for the neuroimaging features in isolation. Our motivation was to focus specifically on neuroimaging, and specifically on recognized ML methods, but it is possible we excluded studies with high clinical value and translational potential.
Second, the heterogeneity in classification tasks, ML methods used and statistical reporting across studies may have introduced bias when trying to decipher which tasks and results to extract. More specifically, this was an issue with the more technical studies which compared multiple (often > 5) ML methods across three or more classification groups introducing a large number of comparisons and results to consolidate and extract. For this reason, we decided to run our meta-analysis on a very specific task from which we could extract the AUC value for classifying AD versus healthy controls. This heterogeneity in AI methods, imaging modalities, and patient cohorts also meant that we were unable to provide insight into which features performed best for specific classification tasks. We do not address significant ethical issues in big data analysis of data security, consent to data sharing, and the acceptability of AI methods to clinicians and the general public.
Third, we employed a risk of bias screening tool that depended on a subjective judgment for each paper's inclusion or exclusion, and there may have been heterogeneity in this assessment between screeners. We chose a low threshold for inclusion based on study quality in order to accurately depict and identify current barriers in the literature limiting translation to clinical practice. We only excluded studies exhibiting clear methodological concerns, such as lack of reporting of basic participant demographics. The screening tool had a binary outcome (inclusion/exclusion), and we were unable to investigate the potential relationship between study quality and ML performance.
5 CONCLUSIONS
In this systematic review, we generate a number of recommendations to facilitate translation of ML methods for patient benefit in the diagnosis and prognosis of dementia. We highlight issues of methodological heterogeneity, clinical relevance of results, and validation/replication of findings. We offer a set of recommendations to address key gaps in the literature including the importance of addressing key clinical questions, providing sufficient details of AI methods, and validating findings in independent datasets which are clinically relevant. Looking forward, the field is likely to move toward the establishment of real-world datasets, multi-model imaging methods, and complex ML algorithms emphasizing the importance of providing sufficient methodological details to enable independent replication. We are optimistic that addressing these concerns will accelerate the translation of ML methods for patient benefit in neurodegenerative disease.
AUTHOR CONTRIBUTIONS
Robin J. Borchert, Michele Veldsman, Timothy Rittman contributed to the conception of the work, drafting and revision of the manuscript for intellectual content. Robin J. Borchert, Michele Veldsman, Timothy Rittman, Jose Bernal, Eugene Tang contributed to the development of the protocol. Veronica Phillips conducted the literature search. Robin J. Borchert coordinated the screening process. Robin J. Borchert, Michele Veldsman, Timothy Rittman, Tiago Azevedo, AmanPreet Badhwar, Jose Bernal, Matthew Betts, Rose Bruffaerts, Helena M. Gellersen, Audrey Low, Christopher R. Madan, Maura Malpetti, Jhony Mejia, Sofia Michopoulou, Carlos Muñoz-Neira, Marion Peres, Siddharth Ramanan, Stefano Tamburin, Hanz M. Tantiangco, Lokendra Thakur, Alessandro Tomassini, Ashwati Vipin, Eugene Tang, Danielle Newby screened papers for inclusion in the review. Robin J. Borchert, Jose Bernal, Helena M. Gellersen, Audrey Low, Jhony Mejia, Carlos Muñoz-Neira, Marion Peres, Hanz M. Tantiangco extracted data from eligible papers. Robin J. Borchert, Michele Veldsman, Timothy Rittman, Jose Bernal, Lokendra Thakur contributed to analysis and interpretation of the data. Lokendra Thakur contributed to the meta-analytic approach. Robin J. Borchert, Michele Veldsman, Timothy Rittman, AmanPreet Badhwar, Jose Bernal, Matthew Betts, Rose Bruffaerts, Michael C. Burkhart, Ilse Dewachter, Audrey Low, Luiza Machado, Maura Malpetti, Jhony Mejia, Sofia Michopoulou, Jack Pepys, Stefano Tamburin, Lokendra Thakur, Ashwati Vipin contributed to the writing of the manuscript. Michele Veldsman and Timothy Rittman provided study supervision. Janice M. Ranson and David J. Llewellyn conceived and organized the symposium from which this paper and others in the series originated, obtained funding, contributed to the conception of the work, revised the manuscript for intellectual content, and harmonized the manuscript with other papers in the series. Ilianna Lourida revised the manuscript for intellectual content and harmonized the manuscript with other papers in the series. All authors read and approved the final manuscript.
ACKNOWLEDGMENTS
With thanks to the Deep Dementia Phenotyping (DEMON) Network State of the Science symposium participants (in alphabetical order): Peter Bagshaw, Robin Borchert, Magda Bucholc, James Duce, Charlotte James, David Llewellyn, Donald Lyall, Sarah Marzi, Danielle Newby, Neil Oxtoby, Janice Ranson, Tim Rittman, Nathan Skene, Eugene Tang, Michele Veldsman, Laura Winchester, Zhi Yao. This paper was the product of a DEMON Network state of the science symposium entitled “Harnessing Data Science and AI in Dementia Research” funded by Alzheimer's Research UK. Race against Dementia Alzheimer's Research UK (ARUK-RADF2021A-010). Jose Bernal is supported by the MRC Doctoral Training Programme in Precision Medicine (Award Reference No. 2096671). Amanpreet Badhwar is supported by Fonds de recherche du Québec Santé—Chercheur boursiers Junior 1 and Fondation Courtois. Matthew Betts is supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 425899996 – SFB 1436 Project A08 and by the German Federal Ministry of Education and Research (BMBF, funding code 01ED2102B) under the aegis of JPND. Eugene Tang, NIHR Clinical Lecturer, is funded by the National Institute for Health and Care Research (NIHR). The views expressed in this publication are those of the author(s) and not necessarily those of the NIHR, NHS or the UK Department of Health and Social Care. Sofia Michopoulou, NIHR Clinical Lecturer, was funded by the National Institute for Health Research (NIHR), the NIHR Applied Research Collaboration ARC Wessex, the Southampton Academy of Research and the Health Education England Topol Fellowship program. The views expressed in this publication are those of the author and not necessarily those of the funding bodies. Carlos Muñoz-Neira was supported by the Government of Chile through ‘Becas Chile’ and CONICYT—National Commission for Scientific and Technological Research [CONICYT—Comisión Nacional de Investigación Científica y Tecnológica], the University of Bristol (Grant Code G100030-150), and its Postdoctoral Research Associate position at the University of Sheffield. Janice Ranson and David Llewellyn are supported by Alzheimer's Research UK and the Alan Turing Institute/Engineering and Physical Sciences Research Council (EP/N510129/1). DJL also receives funding from the Medical Research Council (MR/X005674/1), National Institute for Health Research (NIHR) Applied Research Collaboration South West Peninsula, National Health and Medical Research Council (NHMRC), and National Institute on Aging/National Institutes of Health (RF1AG055654). Timothy Rittman is supported by the Cambridge Centre for Parkinson's Plus Disorders and the Cambridge Biomedical Research Centre. This manuscript was facilitated by the Alzheimer's Association International Society to Advance Alzheimer's Research and Treatment (ISTAART), through the Artificial Intelligence for Precision Dementia Medicine professional interest area. The views and opinions expressed by authors in this publication represent those of the authors and do not necessarily reflect those of the PIA membership, ISTAART or the Alzheimer's Association. [Correction added on 01 September 2023, after first online publication: The preceding two sentences were added.]
CONFLICT OF INTEREST STATEMENT
The authors declare that they have no conflicts of interest. Author disclosures are available in the supporting information.