Acoustic speech features are associated with late-life depression and apathy symptoms: Preliminary findings
Noham Wolpe and Eyal Bergmann contributed equally to this study as senior authors.
Abstract
BACKGROUND
Late-life depression (LLD) is a heterogenous disorder related to cognitive decline and neurodegenerative processes, raising a need for the development of novel biomarkers. We sought to provide preliminary evidence for acoustic speech signatures sensitive to LLD and their relationship to depressive dimensions.
METHODS
Forty patients (24 female, aged 65–82 years) were assessed with the Geriatric Depression Scale (GDS). Vocal features were extracted from speech samples (reading a pre-written text) and tested as classifiers of LLD using random forest and XGBoost models. Post hoc analyses examined the relationship between these acoustic features and specific depressive dimensions.
RESULTS
The classification models demonstrated moderate discriminative ability for LLD with receiver operating characteristic = 0.78 for random forest and 0.84 for XGBoost in an out-of-sample testing set. The top classifying features were most strongly associated with the apathy dimension (R2 = 0.43).
DISCUSSION
Acoustic vocal features that may support the diagnosis of LLD are preferentially associated with apathy.
Highlights
- The depressive dimensions in late-life depression (LLD) have different cognitive correlates, with apathy characterized by more pronounced cognitive impairment.
- Acoustic speech features can predict LLD. Using acoustic features, we were able to train a random forest model to predict LLD in a held-out sample.
- Acoustic speech features that predict LLD are preferentially associated with apathy. These results indicate a predominance of apathy in the vocal signatures of LLD, and suggest that the clinical heterogeneity of LLD should be considered in development of acoustic markers.
1 INTRODUCTION
Major depressive disorder (MDD) is a highly disabling, common mental health condition worldwide, and has unique implications for individuals in older age.1 Up to 10% of individuals aged ≥ 65 receiving care in primary health settings show clinically notable depression, defined as late-life depression (LLD).2 However, LLD is frequently overlooked or inadequately addressed within primary care settings.3
The definition of LLD, similar to MDD, relies on categorical diagnoses and is based on a clinical assessment and questionnaires, as outlined in diagnostic manuals, such as the International Classification of Diseases 11th revision and the Diagnostic and Statistical Manual of Mental Disorders 5th Edition.4, 5 However, these diagnostic classifications often lack the depth needed to fully describe the unique clinical profile of each individual, thereby failing to capture the substantial heterogeneity within the disorder. This failure holds clinical significance, as the inability to differentiate between neurobiologically distinct clinical entities within depression6 can lead to ineffective treatment.7
LLD is predominantly characterized by apathy or a lack of interest in activities, leading to reduced goal-directed behavior.8 This contrasts with early-life depression (ELD), which is more strongly associated with mood-related symptoms. In line with that, LLD is associated with distinct cognitive and brain changes compared to ELD.9 This supports the hypothesis that apathy may reflect neurodegeneration instead of a mood disorder per se.10-14
Apathy within LLD can be assessed using the Geriatric Depression Scale (GDS).15 Specifically, the withdrawal–apathy–vigor dimension in GDS has been shown to be associated with cognitive decline, with preferential impact on executive functions.16 In addition to diagnosis using clinical assessment and scales, objective markers have been increasingly used in recent years to support mental health diagnosis in general, and depression in particular.17 There is growing evidence that speech analysis can be used to support diagnosis.18 Spoken language contains crucial insights into the speaker's physical health and potential medical issues. The distinctive qualities of a person's voice hold significant potential in identifying mood disorders.19 Studies have shown that depressed individuals exhibit reduced speech rate, increased pauses, monotone pitch, and lower vocal intensity, reflecting psychomotor retardation and emotional blunting.18 These vocal characteristics are not only markers of emotional state, but also reflect underlying cognitive and neurobiological changes associated with depression.20, 21 Depression alters various aspects of speech, and analyzing these characteristics can enhance traditional diagnostic methods.22 Speech features are influenced by content, emotional tone, and context, with negative situations increasing monotony and pauses.19 Expressiveness varies with interpersonal dynamics, such as familiarity with the listener.23 Similar vocal patterns, like reduced pitch variability and pauses, are also seen in conditions like Parkinson's disease and Alzheimer's disease, linked to motor and cognitive control disruptions.22, 24 Speech analysis offers a non-invasive tool for understanding neurobiological mechanisms across disorders.25, 26
RESEARCH IN CONTEXT
-
Systematic review: In this study, we sought to investigate the relationship among depressive symptoms, cognitive functions, and acoustic features of speech. Using acoustic features, we were able to train a random-forest algorithm to predict late-life depression (LLD). Furthermore, examining the association of the acoustic features selected for the model with specific depressive symptoms, we found that more are related to apathy over other depressive dimensions and cognitive performance.
-
Interpretation: These results indicate a predominance of apathy in the vocal signature of LLD and suggest that the clinical heterogeneity of LLD should be considered in development of acoustic markers.
-
Future directions: Focusing on heterogeneity in LDD involves in-depth clinical evaluation of different depressive dimensions and cognitive changes. Development of vocal markers and examination of their association with specific depressive dimensions could potentially identify different depressive subtypes and guide personalized diagnosis and treatment.
Recent studies have demonstrated strong classification capabilities for identifying MDD and depressive symptom severity using speech features, but the majority of patients were young adults.21, 23 Several studies have examined the use of machine learning to analyze speech patterns in older adults with LLD, yielding reasonable outcomes.25, 26 However, which depressive symptom dimensions in LLD these acoustic features are sensitive to remains unclear. This study aims to bridge this gap by analyzing acoustic speech features in elderly patients. We hypothesized that, similar to previous studies, there will be specific acoustic speech features that will classify LLD. Importantly, we hypothesized these features will be most sensitive to the apathy dimension.
2 METHODS
2.1 Participants
We recruited 40 participants from the old age psychiatry outpatient clinic at Rambam Health Care Campus. This clinic is in the largest tertiary medical center in Northern Israel, and patients typically come for cognitive and mental health assessments. Ethical approval was obtained from the hospital Helsinki Committee (reference number RMB-0443-22), and all patients provided informed consent. Inclusion criteria were individuals: (1) aged ≥ 65, (b) capable of understanding the course of the experiment, and (3) with intact reading abilities. Exclusion criteria were patients (1) presenting with active psychotic symptoms, (2) with severe cognitive decline precluding participation in the experiment, (3) with language comprehension difficulties, and (4) with a speech impediment.
2.2 Clinical assessments
All participants completed the recruitment phase and underwent assessment of depressive symptoms using the 15-item GDS,15 administered by a trained psychiatrist. The GDS is an instrument that was developed to assess depressive symptoms and screen for depression among older people.15 Here we use the 15-item version with the widely used cut-off score of 6/15 to categorize participants as either experiencing depression or not.27 Previous research identified five separate dimensions that have been previously implicated and include distinct subdimensions.28 Here, we examined these dimensions using the following binary thresholds: (1) subjective memory and (2) anxiety are already binary (0 or 1 score). For (3) withdrawal–apathy–vigor we used a threshold of 2/3 as previously described,29 and for (4) hopelessness, we used 2/3 as in (2). For the (5) dysphoric mood dimension, we did not identify a binary threshold for this score in the literature. Therefore, we examined its distribution within our cohort and determined a data-driven cut-off of 3/7. Through visual inspection, this threshold effectively separated depressed and non-depressed participants. In addition, cognitive function was assessed using the Montreal Cognitive Assessment (MoCA),30 which examines six different domains, including executive function, memory, visuospatial abilities, language, attention, and orientation. A cut-off score of 24/30 was used to categorize participants as cognitively intact or impaired.31
2.3 Voice recording and preprocessing
Each participant was asked to read a pre-written text, consisting of five lines with an “optimistic tone,” as positive context has previously shown to be more accurate in depression classification,32 and reading is less sensitive but more specific relative to interview or picture description.33 The text included a description of a pleasant scenario, written in basic language. The instruction given by the clinician who administered the text was to read toward the microphone without further guidance. Participant speech was recorded using a condenser microphone with an internal noise cancellation filter (Tonor TC-777). Sampling rate was 44.1 KHz.
The recordings were preprocessed as follows. First, noise reduction was applied using a noise reduction algorithm,34 with default arguments as input, to mitigate background noise interference and to enhance the clarity of the speech signals. Second, amplitude normalization was performed to ensure uniform scaling among the recordings. This involved rescaling the signal to achieve a maximum absolute amplitude value of 1.
We used the OpenSMILE toolkit35 to extract a total of 6374 acoustic features, focusing on both spectral and temporal attributes of the signal, as outlined in Table S1 in supporting information. The feature extraction process used the Voice Quality and ComparE16 feature sets, which have been shown to be effective in detecting depressive symptoms.36 For feature extraction, a sliding window algorithm was used, with a window size of 25 ms and a 10 ms overlap—parameters that were selected based on previous studies.37 The features were computed for each window, and the final features were obtained by calculating the mean, median, and standard deviation for each feature across windows. As our interest was in sex-invariant speech features, we only considered features which did not show any sex differences. Features which showed sex-related differences (tested using independent sample t test) were excluded from further analyses (140 features excluded). All preprocessing steps and analyses were conducted in Python using Jupyter notebook.38
2.4 Classification model
We used two classification models to minimize model-specific biases. To optimize accuracy, and to consider non-linear relationships, we used random forest39 and XGBoost.40 Participant datasets were randomly split into a training set (60%, n = 24) and a test set (40%, n = 16). Importantly, feature selection and model training were conducted only using the training set, and the resulting model was tested independently on the held-out testing set.
Subsequently, for each feature, we constructed an individual receiver operating characteristic (ROC) curve, by systematically varying cut-off values across the range of feature values in the training set. We evaluated the ability of each feature to discriminate between depressed and non-depressed participants (as estimated using GDS cut-off of 6/15) by examining sensitivity and specificity. The cut-off values were manipulated to span the full range of feature values. The area under the ROC curve (auROC) was used as a measure of discriminability. Features that showed high discriminability, as defined by auROC > 0.85,41, 42 were selected for the classification model.
Both the random forest and XGBoost models were implemented using the Python “scikit-learn” and “xgboost” packages.43 To train the random forest model, we used a hyperparameter optimization approach with a grid search method. Specifically, we explored various hyperparameters, including the number of estimators set to either to 100 or 500, with maximum depths ranging from 1 to 3, and different criterion types (“gini” and “entropy”). Additionally, we varied the maximum feature parameter, considering options such as “sqrt,” “log2,” and none. For the XGBoost model, we optimized the learning rate, number of estimators, and tree depth. To accommodate for the limited size of the training set, we used a leave-one-out cross-validation (LOOCV) approach to assess the performance of different hyperparameter combinations. LOOCV is particularly advantageous for small datasets, providing a minimally biased estimate of model performance.44 After the evaluation of all hyperparameter combinations, we selected the best-performing model based on its auROC score. Finally, the best-performing model in the training set was applied to the held-out data of the testing set, and its performance was evaluated using an ROC curve. In addition to the ROC curve, we conducted a post-training analysis using SHapley Additive exPlanations (SHAP)45 to assess the importance of each feature in the best performing model. SHAP values were calculated to provide further insight into the contribution of specific features to model predictions.
2.5 Comparison between depressive dimensions
After establishing a classification model for LLD, we sought to examine whether features selected for the classification model were sensitive to specific depressive dimensions or cognitive performance. To address this question, we conducted a set of logistic regression models, each with the following dependent variables: (1) apathy, (2) dysphoric mood, (3) hopelessness, (4) subjective memory, (5) anxiety, and (6) cognitive impairment, based on features selected for the LLD classification models. To compare among dimensions, we examined variance explained by each model (R2) and Akaike information criterion (AIC). Both metrics represent the best-fitting model, given that all models share the same input variables.
3 RESULTS
3.1 Depressive dimensions and cognition
A total of 40 participants were included in the study, and their demographic and clinical characteristics are detailed in Table 1. Among the 40 participants, 21 met the cut-off for depression according to the GDS questionnaire. Depressed individuals showed higher scores for each of the five GDS dimensions. Depressed and non-depressed participants did not differ in age or sex. However, comparing cognitive function, depressed patients showed significantly lower MoCA scores. As found in previous research,16 cognitive performance was most strongly correlated with the apathy dimension of the GDS (Figure 1).
Depressed (n = 21) | Non-depressed (n = 19) | p value | |
---|---|---|---|
Sex (female/male) | 11/10 | 13/6 | p = 0.301 |
Age (mean ± SD) | 72.7 ± 4.19 | 74.1 ± 4.85 | p = 0.337 |
MoCA score (mean ± SD) | 20.6 ± 5.13 | 24 ± 4.37 | p = 0.032 |
Total GDS (mean ± SD) | 9.24 ± 2.59 | 1.84 ± 1.92 | p < 0.001 |
Dysphoric mood (mean ± SD) | 4.24 ± 1.67 | 0.32 ± 0.58 | p < 0.001 |
Apathy (mean ± SD) | 2.19 ± 0.93 | 0.9 ± 1.1 | p < 0.001 |
Hopelessness (mean ± SD) | 1.33 ± 1.11 | 0.26 ± 0.65 | p < 0.001 |
Anxiety (mean ± SD) | 0.57 ± 0.6 | 0.16 ± 0.38 | p = 0.006 |
Subjective memory (mean ± SD) | 0.71 ± 0.46 | 0.21 ± 0.42 | p < 0.001 |
- Abbreviations: GDS, Geriatric Depression Scale; MoCA, Montreal Cognitive Assessment; SD, standard deviation.

3.2 Acoustic speech features can classify LLD
We identified five features that met the pre-defined auROC threshold of 0.85: filtered auditory spectrum range (auROC = 0.95), Mel-frequency cepstral coefficient (MFCC) for speech articulation (auROC = 0.90), MFCC for vocal tract dynamics (auROC = 0.90), spectral skewness in the power spectrum (auROC = 0.87), and logarithmic harmonics-to-noise ratio (Log HNR; auROC = 0.87).
Implementing these features in both the random forest and XGBoost models, we found that the optimized parameters for the random forest model included 100 estimators, a maximum depth of 1, “gini” criterion, and maximum features set to “sqrt.” For XGBoost, the optimized parameters included a learning rate of 0.05, 100 estimators, and a maximum depth of 2. Using these parameters, XGBoost achieved a test set auROC of 0.84, while random forest achieved a test set auROC of 0.78 (Figure 2). Despite the small sample size, these results demonstrate that both models can generalize to new data and classify depression based on speech acoustic features.

To further explore feature importance, we applied SHAP analysis to the best-performing model, which was the XGBoost model (Table 2). The SHAP values showed that Minimum Range of Auditory Spectrum Filter and MFCC Coefficient 2 had the strongest impact on classification. In the training set, the SHAP values for these features were 0.848 and 0.921, and in the test set, they were 0.736 and 0.779, respectively.
Feature | Train SHAP value | Test SHAP value |
---|---|---|
Filtered Auditory Spectrum | 0.848 | 0.736 |
MFCC Coefficient 2 | 0.921 | 0.779 |
MFCC Coefficient 4 | 0.443 | 0.291 |
Spectral Skewness | 0.557 | 0.518 |
Log HNR | 0.617 | 0.557 |
- Note: SHAP values for the most important features in the XGBoost model, demonstrating their contribution to the classification of LLD in both the training and test sets. SHAP values range from ≈ –1 to 1, with higher positive values representing greater contributions to the likelihood of classifying an individual as having LLD. Negative values indicate that a feature decreases the likelihood of LLD classification.
- Abbreviations: LLD, late-life depression; Log HNR, logarithmic harmonics-to-noise ratio; MFCC, Mel-frequency cepstral coefficient; SHAP, SHapley Additive exPlanations.
3.3 Acoustic speech features that classify LLD are preferentially associated with apathy
After establishing that acoustic features can classify LLD in an outpatient old-age population, we sought to examine the sensitivity of these top five acoustic features to the dimensions of LLD. Using a set of logistic regression analyses, we examined how models based on the identified acoustic features are associated with dysphoric mood, apathy, and cognitive impairment. The results are summarized in Table 3 and illustrated in Figure S1 in supporting information. The results showed that these five acoustic speech features were specifically associated with the apathy dimension of the GDS, and to a less extent with subjective memory, but not other dimensions or cognitive performance. The specific association between these acoustic features and apathy may result from the apathy dimension dominating the total depression score of the GDS, for example, if most of the variance in depression score was attributed to apathy. However, subsidiary control analyses showed that the dimension that showed the highest correlation with total GDS score was dysphoric mood and not apathy (Figure 3).
Apathy | Dysphoric mood | Hopelessness | ||||
---|---|---|---|---|---|---|
R2 (model P value) | 0.434** (p = 0.008) | 0.269 (p = 0.111) | 0.217 (p = 0.260) | |||
AIC | 51.4 | 57.6 | 52.5 | |||
Features | Estimate | SE | Estimate | SE | Estimate | SE |
Intercept | −8.86 | 6.12 | −8.67 | 5.82 | −3.63 | 5.86 |
Filtered Auditory Spectrum | 25.71* | 12.33 | 11.77 | 10.05 | −0.959 | 10.70 |
MFCC Coefficient 2 | 1.64 | 17.84 | 14.69 | 17.52 | 8.14 | 18.59 |
MFCC Coefficient 4 | 86.73 | 104.45 | 85.60 | 103.53 | 52.25 | 107.93 |
Spectral Skewness | −292.13* | 137.77 | −61.38 | 106.83 | 111.35 | 109.54 |
Log HNR | 120.62 | 125.42 | 22.95 | 75.16 | 21.75 | 83.54 |
Subjective memory | Anxiety | Cognitive impairment | ||||
---|---|---|---|---|---|---|
R2 (model P value) | 0.326* (p = 0.047) | 0.203 (p = 0.264) | 0.101 (p = 0.509) | |||
AIC | 56.1 | 58.5 | 57.2 | |||
Features | Estimate | SE | Estimate | SE | Estimate | SE |
Intercept | −4.21 | 5.61 | −2.93 | 5.28 | −0.407 | 1.665 |
Filtered Auditory Spectrum | 7.26 | 10.14 | −0.386 | 9.14 | −0.407 | 1.665 |
MFCC Coefficient 2 | 11.90 | 17.02 | 8.43 | 15.51 | 0.25 | 8.63 |
MFCC Coefficient 4 | 101.03 | 107.01 | 92.58 | 84.78 | 92.58 | 84.78 |
Spectral Skewness | −349.27* | 170.84 | 59.49 | 104.82 | 7.723 | 16.84 |
Log HNR | 43.84 | 83.71 | 7.72 | 16.84 | 21.75 | 83.54 |
- Abbreviations: Log HNR, logarithmic harmonics-to-noise ratio; MFCC, Mel-frequency cepstral coefficient; SE, standard error.
- *p < 0.05; **p < 0.01.

4 DISCUSSION
Using vocal acoustic features, we were able to train random forest and XGBoost models to classify LLD in a held-out sample. Furthermore, we found that the acoustic features that could reliably classify LLD were more strongly related to apathy than other depressive dimensions, such as dysphoric mood or cognitive impairment. These results indicate a predominance of apathy in the vocal signatures of LLD and suggest that the clinical heterogeneity of LLD should be considered in the development of acoustic markers.
The speech analysis identified five features that significantly differentiated individuals with and without LLD according to the GDS. The vocal features that contributed most to the classification model were Minimum Range of Auditory Spectrum Filter, MFCC Coefficient 2, MFCC Coefficient 4, Spectral Skewness, and Log HNR. These features are commonly used in vocal analysis and speech processing tasks46 and have been shown to be valuable in the detection of depressive symptoms.20, 25 MFCC coefficients and spectral skewness have been associated with vocal tract characteristics and prosodic elements, which are often impaired in depression.23 Additionally, features like spectral bandwidth and Log HNR reflect vocal “energy” and clarity, which are key components often altered in depressive speech.20, 22 While some of these features, such as MFCC and spectral skewness, have been observed in studies of ELD,21 we identified others that appear to be more specific to LLD, namely the filtered auditory spectrum range and MFCC for speech articulation.
The preferential association between vocal feature characteristics of LLD and apathy could not be simply explained by apathy being the most dominant dimension in GDS, as dysphoric mood explained the most variance in GDS in our sample. Instead, our results suggest that the classification power of acoustic speech features in LLD originated in unique voice characteristics related to apathy. These features may be related to blunted vocal affect, as has been previously suggested.47
The acoustic features identified in this study have also been shown to be altered in other conditions. Similar speech characteristics have been reported in other neurodegenerative and psychiatric conditions, including Alzheimer's disease, Parkinson's disease, and mild cognitive impairment,21, 26 Together, these findings support the clinical need for a more refined dimensional approach for depression, taking into account heterogeneity in dimensions such as dysphoria, apathy, and cognitive impairments, with each characterized by unique yet overlapping clinical features.24, 48
The relationship among depressive symptoms, apathy, and cognitive performance in LLD have been previously described. Previous studies have demonstrated an association between LLD and cognitive impairment,12 showing that depression may be both a risk factor for, and a manifestation of, cognitive decline.49 In line with these findings, LLD is associated with the development of neurodegenerative diseases, including Alzheimer's disease and Parkinson's disease.13 Moreover, the specific association between apathy dimension (as opposed to dysphoric dimension, among others), cognitive changes, and LLD, but not ELD, may further hint at their unique pathophysiology.50 Consistent with these findings, our study demonstrated a significant correlation between cognitive performance and the apathy dimension, but not other depressive dimensions of the GDS.
The distinct cognitive and vocal correlates of apathy within LLD support the possibility that apathy in LLD reflects a distinct clinical entity. This would have significant clinical implications, as apathy constitutes a clinical syndrome in itself with a less favorable response to antidepressant treatment.8 A better understanding of LLD heterogeneity offers promising prospects for more tailored interventions in the future.
4.1 Strengths and limitations
The study used validated cognitive assessments and depression questionnaires assessing the different dimensions of LLD. All questionnaires were administered by trained clinicians who were familiar with the patient population. Speech analysis was conducted using advanced machine learning models. Furthermore, addressing separate dimensions of depressive symptoms in LLD is unique in the context of voice analysis and allows for insight into different components within this heterogeneous disorder. However, it is important to acknowledge the limitations of our study.
First, our study is cross-sectional in nature, which hinders our ability to establish causal relationships or evaluate alterations in depressive symptoms, cognitive function, and acoustic features patterns over time. Second, the present study examined overall cognitive changes without addressing specific cognitive domains, limiting inferences about the contribution of specific cognitive domains and acoustic features. Third, the limited number of participants prevented the execution of more complex models, and confirmatory analyses are required in larger sample sizes in future research. Fourth, the explained variance (R2) of 0.43 for the relationship between acoustic features and the apathy dimension indicates a moderate level of classification power, and additional factors beyond acoustic features may contribute to the variability in depressive dimensions. Fifth, this study specifically focused on LLD and did not include other neuropsychiatric conditions or cognitive impairments in the analysis. Therefore, the ability of these acoustic features to distinguish LLD from other disorders remains untested. Future research should assess these features in broader clinical populations to determine their diagnostic specificity. Sixth, we used a pre-written text, which may compromise our ability to generalize our findings to cases using “free” speech in conversation.
In conclusion, our study identified acoustic vocal features that can differentiate between older individuals with and without LLD. While some of these features have been reported previously for depression in young adults, others have not, which raises the intriguing hypothesis for distinct mechanisms for depression in young adults and LLD. The vocal features differentiating LLD are predominantly sensitive to apathy, which emphasizes the importance of apathy symptoms to LLD, and more generally, the age-related heterogeneity in depression symptoms.
ACKNOWLEDGMENTS
We thank the patients for their participation in the study. N.W. was supported by an Israel Science Foundation Personal Research (Grant No. 1603/22).
CONFLICT OF INTEREST STATEMENT
D.H., S.S., M.G., N.W., and E.B. declare no conflicts of interest in relation to the subject of this study.
CONSENT STATEMENT
All human subjects provided informed consent