The need for a harmonized speech dataset for Alzheimer’s disease biomarker development

This commentary is the product of a concerted effort to understand the needs, barriers, and gaps in the field of speech and language biomarkers for Alzheimer’s disease (AD). It distills interviews, surveys, and extensive correspondence with global leaders in the areas of dementia research, clinical trials, linguistics, and data analytics into an idealized clinical-study design for the harmonized collection of voice recordings. The ultimate goal of the effort is to democratize the ongoing speech and language analytics efforts by making such rich datasets available to the wider research ecosystem.


Introduction
Successful drug development for Alzheimer's disease (AD) depends on clinicians' ability to diagnose and monitor the disease's progression-especially via clear, measurable biomarkers that can detect subtle changes in patients' pathologic neuronal decline long before they show other, more serious symptoms. Alterations of speech and language are showing promise as possible early biomarkers of AD [1].
Researchers can collect and analyze speech and language information using new and improved technology, hardware and data analytics. Likewise, ubiquitous use of smart devices enables remote data collection, both active (prompted by the user) and passive (without user prompts). These tools can measure acoustic features such as pitch and amplitude, as well as lexical and syntactic aspects of speech and features of written language such as text contextual or semantic information-all of which are associated with early AD and its progression [2,3].

Open Access Commentary
Yet researchers have not been able to fully take advantage of the opportunities these tools can offer. To optimize speech and language biomarker discovery, researchers need a comprehensive speech-sample repository that covers a large, diverse cohort of subjects representing different accents, languages, speech and language components, and disease stages. They also need state-of-the-art participant characterization along with harmonized protocols and standards that cover the types of speech and language samples. These activities are nearly impossible for most research groups or startups to achieve on their own due to the costs associated with participant characterization [such as repeated positron emission tomography (PET) scans, magnetic resonance imaging (MRI), and blood-based biomarkers in large longitudinal cohorts].
We believe a global partnership between clinicians, researchers, and data scientists can meet these challenges, facilitating further identification, development and validation of speech-based biomarkers to enable researchers to apply artificial-intelligence algorithms for AD screening, detection, prediction, diagnosis, and monitoring. Existing consortia in related fields demonstrate that global collaboration and data sharing can indeed produce meaningful results (Table 1).
This manuscript summarizes interviews, surveys, and extensive correspondence with global leaders in the areas of dementia research, clinical trials, linguistics, and data analytics and outlines an ideal approach to generating a comprehensive, gold-standard set of speech-and language-based data. The end product of such an approach would be: 1) a rich, diverse, longitudinal, repeatedly measured, high-quality set of speech samples and 2) participant-characterization labels (such as imaging, blood-based biomarkers, or neuropsychological testing and clinical diagnosis) that researchers around the world can use to generate new diagnostic and prognostic algorithms. Here we focus on three broad areas: cohort selection, study design, and data collection and dissemination.

Cohort/patient selection
To obtain a set of speech samples that has the greatest utility for researchers, patients should range from healthy controls (HC) with no risk factors and HC with high risk factors [such as having the apolipoprotein E (APOE) 4 allele] to preclinical/suspected to prodromal/mild cognitive impairment (MCI) to mild AD and eventually to AD. Including disease controls, such as Parkinson's or frontotemporal degeneration, is also important.
To use speech and language biomarkers as a measure of disease progression, the cohorts selected should allow for repeated, longitudinal, preferably high-frequency measurements. The cohorts should also include characterization using digital or traditional neuropsychological tests, genetic testing, MRI or PET imaging,

Protocol and study design and data collection
Given the diversity of potential approaches to collecting, processing, and analyzing speech-and languagebased data, study design for a gold-standard dataset must carefully consider the attributes outlined in Table 2.
Voice recordings can be constrained, in which the subject is prompted to perform a clearly defined task such as recalling a list of words; unconstrained, in which speech samples are collected while the user is performing basic communication tasks such as talking with someone on the telephone; or somewhere in between ( Figure 1).
Each of these approaches carries a different cognitive load and highlights different aspects of speech or language, and likewise provides the ability to reveal changes in speech, language and interaction patterns Table 2. Key considerations for the collection and development of a harmonized speech and language dataset

Cohort/patient selection Study design Data collection and dissemination
• Which patients should be included (e.g., pre-clinical, SCD, mild AD, MCI)? Should disease controls be included (e.g., PD, FTD)?
• Should patients with known risk factors (e.g., APOE positive) be included?
• What is the appropriate balance of ethnicities, geographic diversity, and genders?
• What is the appropriate cohort size?
• What is the minimum level of characterization required (e.g., neuropsychological tests, PET imaging, blood, CSF biomarkers)?
• How should a diversity of languages and accents be incorporated?
• Which types of speech samples should be collected? Consider spanning cognitive domains and cognitive load levels.
• Are the tests active or passive?
• Which speech sample collection tests will be best to characterize a patient's disease progression? Per disease stage?
• Which tests will be most applicable to real-world settings?
• What is the appropriate frequency and duration of test administration?
• Will the setting of data collection (in-clinic or remote) impact patient compliance?
• Can tests be refined/adjusted over time if needed?
• How can annotation and collection be consistently ensured?
• How can broad data sharing and access be facilitated while ensuring patient privacy?
• How can speech sample collection be harmonized? How can researchers ensure that data coming from different cohorts can be aggregated to one database?
SCD: subjective cognitive decline; PD: Parkinson's disease; FTD: frontotemporal dementia Computational load High in addition to changes across multiple cognitive domains. A dataset that combines the raw data from these assessments will provide the largest variety of speech and language features for analysis.
Researchers must consider which aspects of speech and communication they can reliably and consistently collect across different cohorts using different technology platforms. They should also develop standardized protocols for administering, recording, labeling, and annotating (where applicable) the voice samples. These standardized protocols will truly permit meaningful comparisons.

Data dissemination and privacy
The utility of the speech and language dataset depends on researchers' ability to access and analyze it while still maintaining patient privacy and data security. An ideal data sharing platform should address aspects of access (open, limited, nested) and enable virtual processing of datasets within the repository to maintain patient privacy. Possible approaches include allowing researchers to process raw data, run their algorithms, and extract features on a remote privacy-maintaining server versus downloading onto individual computers. Moreover, different levels of processing could be allowed for each interested party, such as limiting access to phonetic and acoustic features, thereby preserving subjects' privacy as much as possible. Approaches to maintaining privacy are evolving and best practices should be implemented and updated when appropriate. Existing voice repositories, such as the Linguistic Data Consortium and DementiaBank, can serve as an example [4,5].

Conclusion
A comprehensive, harmonized, open-access speech-sample repository covering well characterized, large, diverse cohort(s) of subjects can enable the development of better biomarkers that characterize the onset and progression of AD (and other neurodegenerative diseases) in a minimally invasive, low-cost way. At the same time, democratizing speech and language analytics must be a joint effort: at every step along the way, collaboration and cooperation are key. Together, these can facilitate truly seismic shifts in neurodegeneration research.