Data science techniques to gain novel insights into quality of care: a scoping review of long-term care for older adults

Background: The increase in powerful computers and technological devices as well as new forms of data analysis such as machine learning have resulted in the widespread availability of data science in healthcare. However, its role in organizations providing long-term care (LTC) for older people LTC for older adults has yet to be systematically synthesized. This analysis provides a state-of-the-art overview of 1) data science techniques that are used with data accumulated in LTC and for what specific purposes and, 2) the results of these techniques in researching the study objectives at hand. Methods: A scoping review based on guidelines of the Joanna Briggs Institute. PubMed and Cumulative Index to Nursing and Allied Health Literature (CINAHL) were searched using keywords related to data science techniques and LTC. The screening and selection process was carried out by two authors and was not limited by any research design or publication date. A narrative synthesis was conducted based on the two aims. Results: The search strategy yielded 1,488 studies: 27 studies were included of which the majority were conducted in the US and in a nursing home setting. Text-mining/natural language processing (NLP) and support vector machines (SVMs) were the most deployed methods; accuracy was the most used metric. These techniques were primarily utilized for researching specific adverse outcomes including the identification of risk factors for falls and the prediction of frailty. All studies concluded that these techniques are valuable for their specific purposes. Discussion: This review reveals the limited use of data science techniques on data accumulated in or by LTC facilities. The low number of included articles in this review indicate the need for strategies aimed at the effective utilization of data with data science techniques and evidence of their practical benefits. There is a need for a wider adoption of these techniques in order to exploit data to their full potential and, consequently, improve the quality of care in LTC by making data-informed decisions.


Introduction
Data science is a rapidly evolving field that offers many valuable applications for healthcare and may be defined as a set of fundamental principles that support and guide the extraction of information and knowledge from often vast amounts of data, also known as "big data".Big data refers to large amounts of data that often originate from different sources [e.g., websites, electronic health records (EHRs), questionnaires, and interviews], are collected quickly, and are often not only numerical in nature.Although no single widely accepted definition of big data appears to be available, the concept is often described using the four V's [1]: volume, variety, velocity, and veracity.Volume refers to large volumes of data, while variety applies to the different forms and domains of data that can be analyzed individually, but can also be combined, velocity relates to the fast rate at which the data is collected and stored, and veracity refers to the quality.
Examples of data science techniques often used for the analyses of vast amounts of healthcare data include data-and text-mining, machine learning (ML), pattern recognition, and neural networks [2].Systematic reviews on the effectiveness of big data in healthcare have concluded that it may lead to positive changes in health behavior, as well as improved public health policy-making and overall decision-making [3][4][5].In addition, these studies argued that the vast amounts of data have the potential to improve the quality of care while simultaneously reducing the costs, as well as lowering readmission rates and supporting policy-makers and clinicians in developing public policy and service delivery, in addition to assisting hospital management with improving the efficiency of care services and the provision of personalized care to patients [2,6].Despite these promising benefits, the use of these vast amounts of data and innovative data science methods in long-term care (LTC) for older adults seems to be lagging behind other healthcare areas such as hospitals [7,8].Hence, LTC organizations are not currently using the growing amount of data they collect on a daily basis to gain novel insights and foster improvements.
LTC may be characterized as a "set of services delivered over a sustained period of time to people who lack some degree of functional capacity" and can be provided either at home or in LTC facilities such as nursing homes (NHs) or assisted living facilities [9][10][11].In many countries, LTC is being confronted with significant demographic changes and staff shortages while trying to provide high levels of care and remain financially sustainable [12].Emerging technological advances and the continuous implementation of digitalization have the potential to mitigate these challenges, at least partly.Information is of utmost importance: the more high-quality data there is, the more optimally care can be organized [13].As volumes of data continue to pile up and data science gradually penetrates all parts of healthcare, the possibilities of data science for providing novel information, and thus knowledge, related to quality of care for clients and quality of work for staff in LTC can be considered endless.However, the role of data and data science (techniques) in LTC remains unclear.
Published reviews conducted regarding LTC focused on specific individual smart technologies such as sensors or robotics, and merely examined the technology itself, rather than the data it accumulated [14,15].In addition, a recent review on LTC concentrated solely on the acceptability and effectiveness of artificial intelligence (AI) interventions such as smartphone applications, thereby excluding other types of data gathered for LTC [16].Hence, the literature on the use of data science techniques on data accumulated in LTC has yet to be systematically synthesized.We therefore systematically reviewed the literature on the application of data science techniques to analyze (large amounts of) data collected in or by LTC organizations to gain novel insights.The aim of this review was twofold: 1) to assess what data science techniques are used on data accumulated in LTC and for what specific purposes and 2) to assess the results of these techniques in researching study objectives.

Methods
A scoping review was conducted.Both the recently updated guidelines for scoping reviews by the Joanna Briggs Institute [17] as well as the preferred reporting items for systematic reviews and meta-analyses (PRISMA) extension for scoping reviews checklist were followed [18].

Inclusion and exclusion criteria
Publications were included if: 1) they reported on a data science technique for obtaining information from data, which might include "rather novel" techniques such as deep learning and text-mining, but also more "traditional techniques" such as regression analyses.Since there is considerable overlap between math, statistics, data science, and computer science [19] and this review is the first one of his kind, a broad scope was chosen, 2) they were based on data accumulated in or by an LTC facility for older adults, with a facility being considered an LTC facility if it accorded with the following description by Sanford et al. [9] (2015): "LTC occurs in a residential facility or NHs and is primarily intended for those who require assistance with activities of daily living and instrumental activities of daily living, and/or for those who have behavioral problems due to dementia", and 3) they reported original research (e.g., letters to the editor or comments were excluded).Studies were also excluded if they were not published in English and if the full text was not available.The search was not limited by research design or publication date.

Selection process
The screening and selection process was carried out by two authors (AH and SA) (see Figure 1): the data were extracted in duplicate into separate Excel forms (available upon request).The studies yielded from the search strategy were first screened for eligibility based on their titles.Titles that did not comply with the pre-specified inclusion criteria were removed, while ambiguous ones were kept separate and further discussed among all co-authors.Afterward, the abstracts of titles that fit the pre-specified inclusion criteria were screened.Abstracts that did not meet the inclusion criteria were removed and the reasons for removal were noted.The remaining publications were assessed for eligibility based on their full texts.Those that did not meet the inclusion criteria based on their full text were assessed as ineligible and excluded from use in the current review.Again, the reasons for exclusion were noted.

Data extraction and analyses
The standardized form for data extraction in the Joanna Briggs Institute guidance was used as a basis and adapted to meet the needs of the current scoping review [17].The study characteristics were described in tabulated form: author(s), year of publication, country of origin, objective, setting and study population, analyzing technique, metric used, conclusion, limitations, and whether ethical approval had been obtained (see Table 1).The overall findings were reported by means of narrative synthesis based on the two postulated aims.In order to provide a broad overview of this topic, a methodological quality assessment of the included works was not performed, consistent with the methodology of scoping reviews [17].

Results
The search strategy yielded 1,488 studies.After the screening of titles and abstracts, seventy one studies were read and assessed for eligibility based on a detailed analysis of their full texts (see Figure 1).In total, twenty seven studies fulfilled the pre-specified inclusion criteria and were assessed as eligible for use in the current scoping review.The main reasons for exclusion were a lack of data-analyzing techniques, or being conducted in a setting other than LTC.The selection process is visualized in the flowchart shown in Figure 1.

Characteristics of the included studies
A detailed overview of the characteristics of each included study is shown in Table 1.The majority of studies were published between 2020 and the end of 2022.The countries in which the studies were conducted were diverse: six studies were conducted in the US [20][21][22][23][24][25], four in Australia [26][27][28][29], three in Japan [30][31][32] and China [33][34][35], two in Korea [36,37], France [38,39], Spain [40,41], one in the United Kingdom [42], the Netherlands [43], Ireland [44], Canada [45], and Belgium [46].The number of included LTC facilities and the size of the study population varied greatly between publications.About half of the studies reported that they had obtained ethical approval from a review board.

Data science techniques used and purpose
A diverse set of data-analyzing techniques was used in the studies.The majority of studies reported to have deployed a form of regression (n = 8), text-mining/NLP (n = 8), random forest models (n = 5), and SVMs (n = 4) (i.e.several studies used various methods).In terms of metric, accuracy was the most used (n = 5); 7 studies did not report a metric.While some studies mentioned deploying an ML technique, other studies refer to the term AI or algorithms, i.e.ML is a part of AI, while algorithms can be considered as part of ML, and thus, of AI [47], indicating that different terms are used for interchangeable to report on the data science techniques at hand.In addition, the terms text-mining and NLP are both used to refer to analyzing (large) amounts of text.Studies did not report on the use of supervised, unsupervised, or semi-supervised methods.
ML techniques were used to predict factors for pressure ulcers and falls, to identify and assess the risk of COVID-19 infection, as well as to develop a recommendation system for preference assessment and infectious diseases.A neural network based on deep learning was used to predict the risk and time of falling and to improve nurse-patient interaction.Text-mining was applied to EMR data in order to identify risk factors related to medication management.A likelihood-based pursuit data mining technique was employed to predict the likelihood of falls.AI software was used to analyze the facial emotion of residents with dementia.Several publications deployed algorithms.One study reported an AI algorithm utilized to identify frailty, while another study reported an algorithm to infer individualized visual models of human behavior.Modified immune algorithms were used to find the most favorable solutions for spatial optimization and, lastly, algorithms were also deployed to identify person-to-person transmission paths during an illness outbreak.

Outcomes of the included studies
All studies concluded that the data science technique used was "effective": each study reported that the data science technique was useful for the study objective at hand.Words and sentences such as "was useful to infer…", "was able to provide information on…", and "can be used to", were stated in the conclusion sections of the included studies.
Three studies compared various ML techniques (e.g., random forest, logistic regression, naive Bayes, etc.) in terms of accuracy and predictive values related to respectively, pressure ulcers, falls, and infectious diseases in NHs; two of them concluded that a random forest model provided the greatest accuracy and prediction for these outcome measures.Other studies using ML techniques were able to quantify and predict the risk of COVID-19 infection in NHs, provide accurate recommendations on potential preferences for an NH resident, map spatial accessibility to high-quality NH care, and predict falls.CNN based on deep learning was found to be accurate in fall prediction among NH residents and to be able to predict the time of falling for those with Alzheimer's disease.In addition, another study deploying CNN, showed that real-time video analyses effectively improved the efficiency of nurse-patient interaction.
Studies using text-mining techniques displayed the ability to identify risk factors related to failed medication management in care homes.In addition, another study analyzing large amounts of text showed that NLP can be valuable in evaluating agitation in people with dementia, and the identified behaviors can inform improvements in aged care and nursing.A likelihood-based pursuit technique was able to identify factors associated with falls and to make fall likelihood predictions based on these factors among LTC facility residents.Two studies using AI for the analyses of facial emotion showed that the AI technique was able to identify the beneficial effect of makeup therapy on the cognitive function of female patients.In addition, they reported that AI may be superior to self-reported scales because of its independence of verbal ability and cognition of the patient at hand.An SVM algorithm was found to be capable of accurately identifying frailty among RCF residents based on data held in a routinely collected residential care administrative dataset.Moreover, a modified immune algorithm, using data from the geographic information system, was able to evaluate the current configuration of RCFs in a district of Shanghai.A computerized algorithm provided information on the dynamics of a person-to-person transmitted influenza outbreak in NHs, thereby being able to investigate such events.Studies using regression analyses, a more traditional analyzing method, showed that COVID-19 outbreaks led to adverse outcomes such as reductions in nursing staff levels and that COVID-19 vaccine mandates were associated with increased staff vaccinations.

Discussion
The current scoping review is the first to provide an overview regarding the use of data science techniques on data accumulated in LTC.The results show that, even with a very broad scope, only 27 articles were identified in the current review, pinpointing the diminished use of data science techniques deployed in or by organizations providing LTC to analyze the data they accumulate on a day-to-day basis.
Although only a small number of publications were included in this review, and several of these studies included only a small number of participants, all of them concluded that the data science technique at hand was effective and found the data science techniques demonstrated to be useful for the study objective.All the analyses discussed the usefulness of these techniques in qualitative and future potential terms.However, even with the potential benefits (large amounts of) data and data science techniques seem to offer, LTC might struggle with the same problems that other healthcare sectors (e.g., hospitals) were or are still facing: e.g., an absence of knowledge about which data to use and for which purpose, as well as the lack of an appropriate and comprehensive data infrastructure within organizations [48].In addition, LTC organizations include a variety of data sources that all collect information in various forms: e.g., medical data in EHRs, unstructured textual data based on interviews regarding the experienced quality of care, or real-time data accumulated by sensors or wearables [7,49].The integration of these (semi-)structured data, stemming from a large variety of sources, is a challenge in itself.Strategies for mitigating these challenges, including a sufficient data infrastructure and personnel with expertise in data and communication technology are required in order to utilize the full potential of data accumulated in and by LTC [49,50].Since the majority of studies included in this review were published in or after 2020 (with 10 articles being published in 2022), the popularity of data science within this care setting may rise in upcoming years.Increasing funding to support research on data accumulation and analyses in LTC, along with integrative collaborations between health scientists and computing experts (e.g., data scientists) may help to address the challenges within this specific care echelon.
Several different data-analyzing techniques were deployed in the included studies, of text-mining/NLP, regression models, and random forest models were the most prevalent.These techniques have already been proven to be useful in other healthcare areas [2,6], and may therefore be more widely known and used.Interestingly, data science techniques such as text-mining/NLP, a process aimed and analyzing large amounts of natural language data [51], are primarily reported on in 2022.A review conducted in 2018, reported NLP to be among the most used big data techniques in clinical and operational healthcare [6].In LTC, much quantitative and qualitative information is digitally recorded in EHRs: e.g., client characteristics (e.g., socio-demographic characteristics) and data on various quality indicators (e.g., pressure ulcers) are collected to map the quality of life as well as the quality of life.These data would be perfectly suitable for data exploration using text-mining/NLP.For example, text fields in EHRs can be analyzed: e.g., can certain words (e.g., "imbalance", "restlessness" or even specific types of medication) predict future falls in clients or future agitated behavior?These large amounts of text can thus be utilized to identify and predict critical behavior or symptoms and, in turn, initiate actions in a timelier manner.Interestingly, the terms ML, AI, and algorithms seem to be used interchangeably and for the same purpose: to describe the method that was deployed (e.g., some studies report deploying an "AI method", while others report using ML or "a powerful algorithm to predict").While the terms AI, ML, and algorithms fall in the same domain as data science and are indeed interconnected, they all do have specific applications and meanings [47].When reviewing the metric used, accuracy, measuring the number of correct predictions made by a model, was most prevalent.However, various studies do not specify their method(s) or the metric(s).In order to be more transparent about the usefulness of the methods, more information regarding these measurements should be included in upcoming studies.Especially with the rise of emerging methods such as large language models [i.e.ChatGPT], which have the ability to speed up the use of data science techniques in LTC, information about the used methods and metrics is needed in order to indicate their usefulness for daily care practice.
The majority of studies in this review were focused on the prediction of adverse health problems such as falls, pressure ulcers, and infectious diseases.Not surprisingly, these health problems are reported to be among the most prevalent in LTC organizations [12,52,53].Hence, these studies underscore that novel data-analyzing techniques are used to predict the incidence of already well-known daily care problems in LTC.
Surprisingly, not all studies reported that they had obtained approval from an ethical review board or committee.Ethics forms a major concern due to the vulnerability of patients in LTC and due to the inherent sensitivity of health-related data [7,54].With the increasing amount of data available in healthcare and, more specifically in LTC, data ethics have become increasingly important in this sector.Ethical mistakes may lead to social rejection or imperfect policies and legislation, perhaps resulting in a diminished acceptance and progress of data science within the field of LTC [54].
The current study is the first to provide information on the use of data science techniques in LTC, potentially raising awareness about the variety of opportunities these techniques may provide to this specific care echelon.This review will provide researchers with a useful base for understanding the overall context of data science techniques deployed in LTC.However, the current results need to be viewed in the light of some possible limitations.Firstly, by focusing on PubMed and CINAHL there is a possibility that work published in journals not covered by this database have been omitted.However, PubMed alone already includes more than 33 million citations and is the most used database in the health domain, especially in LTC.Second, in accordance with the guidelines for scoping reviews, a methodological quality assessment of the included studies was not performed [17].Hence, no conclusion regarding topics such as incomplete data, the effectiveness of the deployed methods (e.g., in terms of the small number of participants included in some of the studies), or the external validation of the included studies can be formulated.
Since the studies in this review discussed the usefulness of data science techniques in qualitative and potential terms, more quantitative and objective measures are needed.To make these techniques become more widespread and integrated in LTC (as they are in, for example, hospital care), research should provide solid evidence that based on the analyses of data by these types of techniques, health decisions, and outcomes can indeed be improved for individual clients.Hence, in order to implement data-informed LTC, more thorough evidence regarding the usefulness of data science in directly or indirectly changing and improving daily care practice is needed.This could, for example, include the use of metrics such as accuracy and sensitivity/specificity. Future analyses could also focus on investigating the current state of evidence regarding the use of data science techniques with data accumulated in a home-based LTC environment.The application of these techniques in a home-based LTC environment (i.e.community-dwelling older adults receiving care) remains unexplored and the findings of such a review may supplement those of this analysis.As LTC for older adults is also provided at home [10], the combined evidence of both reviews would produce an even more complete overview of the use of these techniques.However, given the small number of included studies in this review, the amount of studies focused on data science techniques used for data accumulated in a home-based LTC environment, might also be quite small.
In conclusion, this review presents a useful starting point for future applications of data science techniques in LTC by creating awareness of the ramifications of data and the corresponding analyzing techniques.Currently, in LTC, data science techniques are used for a variety of purposes and are advantageous for the specific study objective in each of the included studies.Although data science presents promising opportunities to reshape the use of data within this area (especially given the rise of new techniques such as ChatGPT) in order to improve the quality and efficiency of care, the low number of identified articles indicates the need for strategies aimed at the effective utilization of data with data science techniques and evidence of its practical benefits.
Cumulative Index to Nursing and Allied Health Literature (CINAHL) were deployed for relevant studies.The search was conducted in December 2022.Medical Subject Headings (MeSH) terms, standardized keywords manually assigned by indexers of the National Library of Medicine, were used.The following search string was used: ("Big Data"[MeSH Terms] OR "Big Data analytics"[All Fields] OR "data analytics"[All Fields] OR "Data Science"[MeSH Terms] OR "Medical Informatics"[MeSH Terms] OR "Artificial Intelligence" [MeSH Terms] OR "Machine Learning"[MeSH Terms] OR "Deep Learning" [MeSH Terms] OR "Data Mining"[MeSH Terms] OR "text mining" [All Fields]) AND ("Residential Facilities" [MeSH Terms] OR "residential home*" [All Fields] OR "care home*" [All Fields] OR "Assisted Living Facilities"[MeSH Terms] OR "Homes for the Aged" [MeSH Terms] OR "Nursing Homes"[MeSH Terms]).

Figure 1 .
Figure 1.Flowchart displaying the selection process

Abbreviations
Abbreviations AI: artificial intelligence CINAHL: Cumulative Index to Nursing and Allied Health Literature CNN: convolutional neural network COVID-19: coronavirus disease 2019 EHRs: electronic health records EMR: electronic medical record LTC: long-term care MeSH: Medical Subject Headings ML: machine learning NH: nursing home

Table 1 .
Characteristics and conclusions of included studies