Key considerations for the collection and development of a harmonized speech and language dataset

Cohort/patient selectionStudy designData collection and dissemination
  • Which patients should be included (e.g., pre-clinical, SCD, mild AD, MCI)? Should disease controls be included (e.g., PD, FTD)?

  • Should patients with known risk factors (e.g., APOE positive) be included?

  • What is the appropriate balance of ethnicities, geographic diversity, and genders?

  • What is the appropriate cohort size?

  • What is the minimum level of characterization required (e.g., neuropsychological tests, PET imaging, blood, CSF biomarkers)?

  • How should a diversity of languages and accents be incorporated?

  • Which types of speech samples should be collected? Consider spanning cognitive domains and cognitive load levels.

  • Are the tests active or passive?

  • How are the tests categorized (e.g., constrained, non-constrained)?

  • Which speech sample collection tests will be best to characterize a patient’s disease progression? Per disease stage?

  • Which tests will be most applicable to real-world settings?

  • What is the appropriate frequency and duration of test administration?

  • Will the setting of data collection (in-clinic or remote) impact patient compliance?

  • Can tests be refined/adjusted over time if needed?

  • How can annotation and collection be consistently ensured?

  • How can broad data sharing and access be facilitated while ensuring patient privacy?

  • How can speech sample collection be harmonized? How can researchers ensure that data coming from different cohorts can be aggregated to one database?

SCD: subjective cognitive decline; PD: Parkinson’s disease; FTD: frontotemporal dementia