A machine learning approach to identify correlates of current e-cigarette use in Canada

Aim: Popularity of electronic cigarettes (i.e. e-cigarettes) is soaring in Canada. Understanding person-level correlates of current e-cigarette use (vaping) is crucial to guide tobacco policy, but prior studies have not fully identified these correlates due to model overfitting caused by multicollinearity. This study addressed this issue by using classification tree, a machine learning algorithm. Methods: This population-based cross-sectional study used the Canadian Tobacco, Alcohol, and Drugs Survey (CTADS) from 2017 that targeted residents aged 15 or older. Forty-six person-level characteristics were first screened in a logistic mixed-effects regression procedure for their strength in predicting vaper type (current vs. former vaper) among people who reported to have ever vaped. A 9:1 ratio was used to randomly split the data into a training set and a validation set. A classification tree model was developed using the cross-validation method on the training set using the selected predictors and assessed on the validation set using sensitivity, specificity and accuracy. Results: Of the 3,059 people with an experience of vaping, the average age was 24.4 years (standard deviation = 11.0), with 41.9% of them being female and 8.5% of them being aboriginal. There were 556 (18.2%) current vapers. The classification tree model performed relatively well and suggested attraction to e-cigarette flavors was the most important correlate of current vaping, followed by young age (< 18) and believing vaping to be less harmful to oneself than cigarette smoking. Conclusions: People who vape due to flavors are associated with very high risk of becoming current vapers. The findings of this study provide evidence that supports the ongoing ban on flavored vaping products in the US and suggests a similar regulatory intervention may be effective in Canada. Open Access Original Article © The Author(s) 2021. This is an Open Access article licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Exploration of Medicine


Introduction
Canada and the US have recently witnessed exponential growth of e-cigarette use (vaping), raising worldwide public health concern of a new nicotine epidemic [1,2]. While some studies have shown the benefit of vaping in assisting with smoking cessation [3][4][5], evidence has also directly linked vaping to health conditions including respiratory irritations [6] and lung damage [7]. Indeed, the US Centers for Disease Control and Prevention reports 2,807 cases of vaping-related lung injury, including 68 deaths, as of February 2020 [8]. In September 2019, Canada confirmed its first case of severe pulmonary illness related to vaping, which involved a high school student [9].
The majority of people who have tried vaping do not continue to use the device in the long run [10,11]. It is therefore crucial to identify the small group of users who are likely to become long-term vapers as this may indicate vaping dependency that could lead to chronic health effects. Prior studies have suggested a set of characteristics that may be unique to current vapers, including female, younger age, use of a certain type of vaping device as well as initiating vaping due to attraction to flavors and lower cost [11][12][13][14][15]. However, these results were yielded primarily by regression where multicollinearity is a concern. As person-level variables are usually correlated, e.g., younger people are easily attracted to e-cigarette flavors and are more likely to perceive vaping to have lower risks [16], it is difficult to isolate a set of independent predictors of current vaping using just regression. Hence, this issue warrants the use of more advanced statistical techniques.
Identifying current vapers from people with an experience of vaping represents a supervised binary classification task in machine learning, a discipline of computer science with increasing popularity in health research [17][18][19]. Compared with conventional regression, machine learning leverages computational power to reduce multicollinearity and improve the overall model performance. Applications of machine learning in tobacco research are emerging in recent years [20][21][22][23][24][25], but so far, only one such application has been on vaping behaviours. In this Holland-based study, a random forest model was used in conjunction with cross-sectional survey data to classify adult exclusive vapers from dual users of both cigarettes and e-cigarettes [25]. Here we present a simpler and more intuitive machine learning model-a classification tree-to identify and understand the importance of person-level correlates of current vaping. In other fields of tobacco research, classification trees have demonstrated good performance in predicting the status of lab-verified smoking cessation status [20], adherence to nicotine replacement therapy [23] and use of tobacco within 30-min of waking up [21]. Hence, we aimed to verify the performance of classification tree in vaping research and to provide actionable implications on policy interventions regarding e-cigarettes in a Canadian context.

Study design and sample
This population-based cross-sectional study used data from the 2017 Canadian Tobacco, Alcohol, and Drugs Survey (CTADS) that included 16,349 Canadian residents aged 15+ (excluding institutional residents) from ten provinces (excluding Yukon, Northwest Territories and Nunavut) [26]. The CTADS is a well-validated survey with a range of studies being published using the 2017 data [27][28][29][30]. We used the question, "Have you ever tried an electronic cigarette, also known as e-cigarette?" to identify all of the 3,059 respondents with an experience of vaping.

Outcome
A binary outcome variable was created to represent vaper type (current vs. former vaper). Respondents were defined as a current vaper if they answered, "every day" or "occasionally" to the question, "At the present time, do you use an electronic cigarette, also known as an e-cigarette every day, occasionally or not at all?" People responding "not at all" to the same question were former vapers.

Candidate correlates
A wide range of person-level characteristics were explored as potential correlates of vaper type. These 46 variables were mostly categorical, except for age and years of smoking that were continuous. These variables described demographics, socioeconomic factors, household information, health, vaping behaviours, substance use and perceived risk of vaping and smoking (see below).

Statistical analysis
We summarized the characteristics of current vs. former vapers and used two-sample tests (Fisher's exact test, t-test or the Mann-Whitney U-test) to compare their distributions.
Variable selection has been shown to be a necessary procedure prior to classification tree analysis to reduce the risk of model overfitting and spurious associations [20]. Hence, we used each candidate correlate to predict the odds of being a current vaper in a logistic mixed-effects model with a random intercept indicating provincially based random effects. This model estimates an unadjusted fixed effect for each correlate on all individuals and allows this effect to vary across provinces. Correlates associated with a significant fixed effect (using a 2-sided P-value < 0.05) were selected into the machine learning analysis. This procedure was performed on R using the "lme4" package [31].
We followed the Classification and Regression Tree (CART) algorithm to develop a tree model to classify current and former vapers [32] using the R package "rpart" [33]. CART is a non-parametric method that develops a classification tree by recursive partitioning. In this tree, a node is where a splitting variable, X, and one of its level of value, c, divides the dataset into two regions, X ≤ c and X > c (or X = 0 and X = 1 for binary X), that correspond to the predicted classes of current and former vapers. An optimal splitting variable X and value c minimize the Gini Index at a node, which is an impurity criterion that measures how well a split correctly separates true current vapers from true former vapers (i.e. how "pure" the separation is). The splitting procedure terminates when a node contains < 5 data. "Pruning" of a tree is often necessary as a full tree may be excessively large and complicated that leads to model overfitting. Hence, a cross-validation method is used to identify an optimal number of splits that minimizes a cost complexity function of total misclassified cases with a penalized term for larger tree size. This procedure yields a tree with manageable size and interpretable structure while maintaining its performance.
We used a ratio of 9:1 to randomly split the dataset into a training set (n = 2,753) and a validation set (n = 306). The training set was used to develop and to internally validate the model, while the validation set was used to establish model performance externally as it comprised independent data not used in model construction. Using the data from the training set, we first performed oversampling as the number of former vapers significantly exceeded that of current vapers (with a ratio of 4:1), causing concerns on outcome imbalance that may deter model performance. Hence, a random oversampling with replacement was conducted on current vapers on the training set so that their size was increased to be that of former vapers. We then developed a full tree using data from the oversampled training set to classify current and former vapers with the set of correlates selected from the logistic mixed-effects regression analysis. After that, the pruning procedure was performed using a ten-fold cross-validation method. Classification accuracy, specificity and sensitivity of the pruned tree were calculated during the cross-validation process to establish performance of the model on the training set. Finally, the pruned tree was applied to data from the validation set and accuracy, specificity and sensitivity were computed to demonstrate the external performance of the model.
We assessed the performance of two parsimonious trees, including the one that only had the first split of the full tree and another one with the first two splits. This procedure was adopted from a recent machine learning paper that also used a classification tree to predict smoking cessation status in efforts of quantifying the significance of the top predictors in this model [20]. For both parsimonious tree models, we calculated their accuracy, sensitivity and specificity using data from the oversampled training set and from the validation set separately.
Multiple imputation by chained equation [34] was used to address the very small portion of data missing from the dataset (totaled 1.0%). After visually confirming the assumption of missing at random, five imputed data copies were generated independently, and all analytical procedures were repeated on each of these data copies to compare results with our primary findings (Supplementary Material 1). Analyses were performed on R (version 3.5.1).

Sample characteristics
Of the 3,059 Canadians aged 15+ who reported to have tried vaping, their average age was 24.4 [standard deviation (SD) 11.0] years, with 41.9% of them being female, 8.5% of them being aboriginal and 74.3% residing in urban areas (Table 1). A total of 2,503 (81.8%) were former vapers and 556 (18.2%) reported to be current vapers. The two groups of vapers differed significantly in their characteristics. Notably, current vapers were younger by 2-year on average (mean age = 22.5 vs. 24.9 years) and had lower education (high school or above: 60.4% vs. 79.9%). They were less likely to be female (37.2% vs. 42.9%), currently working (61.2% vs.

Performance of the classification tree
Twenty-nine correlates were identified by the logistic mixed-effects regression analysis to be potentially important ( Table 2; full results see Supplementary Material 2).
Using these correlates, a classification tree (Figure 1) was developed and pruned with a final form comprising just three predictors-attraction to vaping flavor (yes/no), age (with 18-years being chosen as the optimal threshold; age < 18 or age ≥ 18) and believing vaping was less harmful than smoking to oneself (yes/ no). Using cross-validation, the accuracy, sensitivity and specificity of this tree model on the training set was < 0.001 < 0.001 < 0.001 * The mixed-effects models used province of residence as a random intercept; OR: odds ratio; CI: confidence interval 0.71 (95% CI 0.65-0.77), 0.70 (95% CI 0.61-0.76) and 0.71 (95% CI 0.66-0.73), respectively. Applying this model to data from the validation set yielded accuracy, sensitivity and specificity of 0.72, 0.65 and 0.73. Figure 1. Classification tree. r_flav = 0: if the reason for using e-cigarette is not attraction to flavor, otherwise = 1; r_hs = 0: if the reason for using e-cigarette is not due to a belief that vaping is less harmful to oneself than cigarette smoking, otherwise = 1

Correlates of current vaping and importance
The tree model used attraction to flavor as the first splitting variable, followed by age as the second splitting variable and believing vaping to be less harmful than smoking to oneself as the third splitting variable. People who reported to vape due to flavors were predicted to have the highest probability (0.64) of being a current vaper. Among those who vaped for other reasons, minors with ages < 18 had the second highest probability (0.63) of vaping currently. For adults who did not vape for flavors, their probability of current vaping could reach a high of 0.60 if they believed vaping to be less harmful than cigarette smoking to users and otherwise was a low of 0.26 if they did not have such health belief.
In order to understand the importance of the top two correlates (attraction to flavor and age), we compared the performance of the full tree to two parsimonious trees that comprised only the first split (attraction to flavor) or the first two splits (attraction to flavor and age; Table 3). Using data from the oversampled training set, classification accuracy and specificity were the highest in the full tree, but sensitivity was the highest in the 2-split tree that used only attraction to flavor and age for prediction (sensitivity = 0.74 vs. 0.70). Similar results were observed on the validation set where the 2-split tree exceeded the full tree in terms of sensitivity (sensitivity = 0.67 vs. 0.65). However, in general, improvement of model performance from a 1-split tree to a 2-split tree to the full tree was minor (Table 3).

Sensitivity analysis
Five imputed data copies were generated using the multiple imputation by chained equation method after visual inspection confirmed the assumption of missing at random (Supplementary Material 1). All analytical procedures were repeated on the five imputed datasets and the same classification tree involving attraction to flavor (first-split), age (second-split) and believing vaping to be less harmful than smoking to oneself (third-split) was reached at each iteration. Hence, we conclude the tree model is generally insensitive to missing data.

Discussion
We applied machine learning to data collected from a nationally representative sample of Canadians aged 15+ with vaping experience to understand correlates of the current use of the device. A classification tree model was developed and validated with good performance. This model identified vaping due to attraction to flavors to be the most important correlate of current vaping, followed by young age < 18 and vaping with a belief that it was less harmful to oneself than cigarette smoking. Furthermore, we found strong predictive power of the first two correlates, as the 2-split tree demonstrated comparable performance with the full tree.
Our findings confirmed the vital role of e-cigarette flavors on vaping behaviours. In our sample, attraction to flavors was the second most commonly reported reason for vaping (36.4%), following curiosity (77.0%). This observation coincided with a large body of literature that suggested flavors recruited people, especially young people, to start vaping [12,15,[35][36][37][38]. Furthermore, we found some evidence that the attraction to flavors may motivate long-term vaping, which added to the findings of a recent US study that established flavors to be a key part of vaping addiction [15]. These results provide support for bans on flavored e-cigarettes, as seen in some states in the US. In comparison, Canada is slow to action on curbing the epidemic of flavored vaping. Federal-level regulations have prohibited certain e-cigarette flavors, including those with nondescriptive names (e.g., "Miami Heat") and are suggestive of health benefits (e.g., vitamin flavor) [39]. A few provinces, including Ontario and Prince Edward Island, have announced plans to ban flavored e-cigarettes in 2021, but at this moment the sale of these products is largely legal in Canada [40,41]. Our findings suggested a similar ban of all flavored vaping products may be effective in Canada at reducing the uptake and continued use of e-cigarettes.
The classification tree algorithm determined 18-years to be an optimal cut-off value for the age variable and suggested that young people < 18 were associated with high probability of being a current vaper. There are two explanations for this finding: first, it is possible that some of these young people had just started vaping and were thereby more likely to be captured as current vapers in the survey. This speculation could be tested by controlling for a variable that measures the history of vaping, such as the age when started to vape. However, only 17.5% of our sample reported the age at vaping initiation, which impeded us to conduct any additional analysis. Second, it is possible that young age, or in our case, being a youth (aged [15][16][17], is indeed an important risk factor for long-term vaping. If so, our findings provided new insights into the profile of chronic vapers, which was previously deemed to comprise older, heavy smokers who wished to quit [42]. As youth may also continue vaping in the long run, effective programs that help youth vapers to quit at early phase of e-cigarette use are warranted to reduce their risks of progression to established vaping.

Limitations
Due to the cross-sectional nature of the survey data, we were unable to identify true predictors of current vaping, but rather important correlates. Future study with access to longitudinal data could outcome this limitation. Next, our analysis depends entirely on self-reported measures, which may introduce recall bias. However, we believe that the CTADS survey mechanism has been carefully constructed to ensure sufficient answer time for interviewees to adequately recall long-term memory. Third, there are other factors that we do not have access to, such as household income and neighborhood characteristics, that may influence the pattern of vaping. Future researchers with a more comprehensive tracking of people with vaping experiences, preferably through the use of linked administrative dataset, may provide additional insights. Finally, the data used for this study was collected in 2017, which was before the enactment of the Tobacco and Vaping Products Act (in May 2018) [43] and the legalization of recreational cannabis (in October 2018) [44] in Canada. Future researchers may leverage more recent data to explore the impact of these new regulations on vaping behaviours.
In conclusion, by using a classification tree, we identified attraction to flavors to be one of the most important correlates of current vaping. This finding is relevant to future development of regulations on e-cigarettes as most flavored vaping products are still legal in Canada. Furthermore, interventions that target youths are needed to prevent their e-cigarette uptake and help those who have already initiated vaping to quit.

Supplementary materials
The supplementary materials for this article are available at: https://www.explorationpub.com/uploads/ Article/file/100133_sup_1.pdf.

Declarations Author contributions
RF and MC contributed conception and design of the study; RF obtained access to the dataset, performed the statistical analysis and wrote the first draft of the manuscript; NM contributed to methodology. All authors contributed to manuscript revision, read and approved the submitted version.

Conflicts of interest
The authors declare that they have no conflicts of interest.

Ethical approval
Not applicable.

Consent to participate
Not applicable.

Consent to publication
Not applicable.

Availability of data and materials
The dataset analyzed for this study is maintained by the Computing in the Humanities and Social Science (CHASS) at the University of Toronto available from: http://datacentre.chass.utoronto.ca/

Funding
This work was supported by the Canadian Institutes of Health Research, Catalyst Grant #172898. The funder had no role in the study design, collection, analysis or interpretation of the data, writing the manuscript, or the decision to submit the paper for publication.