<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Explor Drug Sci</journal-id>
<journal-id journal-id-type="publisher-id">EDS</journal-id>
<journal-title-group>
<journal-title>Exploration of Drug Science</journal-title>
</journal-title-group>
<issn pub-type="epub">2836-7677</issn>
<publisher>
<publisher-name>Open Exploration Publishing</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.37349/eds.2023.00026</article-id>
<article-id pub-id-type="manuscript">100826</article-id>
<article-categories>
<subj-group>
<subject>Original Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Creation and interpretation of machine learning models for aqueous solubility prediction</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0001-5830-059X</contrib-id>
<name>
<surname>Su</surname>
<given-names>Minyi</given-names>
</name>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/">Conceptualization</role>
<role content-type="https://credit.niso.org/contributor-roles/investigation/">Investigation</role>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/">Data curation</role>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/">Writing—original draft</role>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing—review &amp; editing</role>
<xref ref-type="aff" rid="I1" />
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-7837-3593</contrib-id>
<name>
<surname>Herrero</surname>
<given-names>Enric</given-names>
</name>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/">Conceptualization</role>
<role content-type="https://credit.niso.org/contributor-roles/supervision/">Supervision</role>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing—review &amp; editing</role>
<xref ref-type="aff" rid="I1" />
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="editor">
<name>
<surname>de Azevedo Jr.</surname>
<given-names>Walter Filgueira</given-names>
</name>
<role>Academic Editor</role>
<aff>Pontifical Catholic University of Rio Grande do Sul, Brazil</aff>
</contrib>
</contrib-group>
<aff id="I1">Pharmacelera, 08028 Barcelona, Spain</aff>
<author-notes>
<corresp id="cor1">
<bold>
<sup>*</sup>Correspondence:</bold> Enric Herrero, Pharmacelera, 08028 Barcelona, Spain. <email>enric.herrero@pharmacelera.com</email></corresp>
</author-notes>
<pub-date pub-type="ppub">
<year>2023</year>
</pub-date>
<pub-date pub-type="epub">
<day>30</day>
<month>10</month>
<year>2023</year>
</pub-date>
<volume>1</volume>
<issue>5</issue>
<fpage>388</fpage>
<lpage>404</lpage>
<history>
<date date-type="received">
<day>14</day>
<month>02</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>06</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2023.</copyright-statement>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This is an Open Access article licensed under a Creative Commons Attribution 4.0 International License (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.</license-p>
</license>
</permissions>
<abstract>
<sec>
<title>Aim:</title>
<p>Solubility prediction is an essential factor in rational drug design and many models have been developed with machine learning (ML) methods to enhance the predictive ability. However, most of the ML models are hard to interpret which limits the insights they can give in the lead optimization process. Here, an approach to construct and interpret solubility models with a combination of physicochemical properties and ML algorithms is presented.</p>
</sec>
<sec><title>Methods:</title>
<p>The models were trained, optimized, and tested in a dataset containing 12,983 compounds from two public datasets and further evaluated in two external test sets. More importantly, the SHapley Additive exPlanations (SHAP) and heat map coloring approaches were used to explain the predictive models and assess their suitability to guide compound optimization.</p>
</sec>
<sec><title>Results:</title>
<p>Among the different ML methods, random forest (RF) models obtain the best performance in the different test sets. From the interpretability perspective, fragment-based coloring offers a more robust interpretation than atom-based coloring and that normalizing the values further improves it.</p>
</sec>
<sec>
<title>Conclusions:</title>
<p>Overall, for certain applications simple ML algorithms such as RF work well and can outperform more complex methods and that combining them with fragment-coloring can offer guidance for chemists to modify the structure with a desired property. This interpretation strategy is publicly available at <ext-link ext-link-type="uri" xlink:href="https://github.com/Pharmacelera/predictive-model-coloring">https://github.com/Pharmacelera/predictive-model-coloring</ext-link> and could be further applied in other property predictions to improve the interpretability of ML models.</p>
</sec>
</abstract>
<kwd-group>
<kwd>Aqueous solubility</kwd>
<kwd>machine learning</kwd>
<kwd>fragment-coloring</kwd>
<kwd>property prediction</kwd>
</kwd-group></article-meta>
</front>
<body>
<sec id="s1"><title>Introduction</title>
<p>Aqueous solubility is a key molecular property for the discovery and optimization of new drugs. In the early stage of drug discovery, low molecular solubility is a relevant attrition factor in screening assays. Moreover, solubility has an important impact on Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) properties of drugs, like oral absorption and bioavailability &#x0005B;<xref ref-type="bibr" rid="B1">1</xref>&#x0005D;. With the advent of novel machine learning (ML) algorithms and libraries, the performance of such predictors has increased significantly but remains an open field of research &#x0005B;<xref ref-type="bibr" rid="B2">2</xref>&#x0005D;.</p>
<p>However, there are several open questions in the generation of ML models that go beyond the predictive performance of the models themselves. One of them is model interpretability, which could provide helpful information to researchers in the lead optimization process. Many times, ML models operate as a black box, which, combined with the employment of many descriptors (&#x0003E; 100), makes them difficult to interpret &#x0005B;<xref ref-type="bibr" rid="B2">2</xref>&#x02013;<xref ref-type="bibr" rid="B4">4</xref>&#x0005D;. While many efforts have been made to improve the accuracy of ML models, model interpretation is still under investigation. In the field of model interpretation, there are many model-dependent or -independent strategies, such as feature-based, atom-based, fragment-based, compound-based, or graph-based approaches &#x0005B;<xref ref-type="bibr" rid="B5">5</xref>&#x0005D;. These approaches give aid to the researchers in understanding how a change in the descriptors or the chemical structure could affect the prediction. Since solubility changes can be understood in most cases by the addition/deletion of polar or non-polar atoms, solubility models are a good benchmark set to validate interpretation methods.</p>
<p>Another important aspect when building ML models is the selection of the most appropriate descriptors and algorithms, since not always the most complex and novel methods are the most adequate for all application scenarios. Any increase in complexity should be justified by a significant increase in performance to compensate for the penalty in terms of usability and interpretability it will introduce.</p>
<p>In this work, we will focus on the assessment of which are the best descriptors and ML algorithms to generate an accurate aqueous solubility predictor and what are the best methods for interpreting it. The performance of different ML models will be compared to existing models on different test sets. And then three interpretation approaches (feature-based, atom-based, and fragment-based) will be employed to interpret the solubility model.</p>
</sec>
<sec id="s2"><title>Materials and methods</title>
<sec><title>Dataset preparation</title>
<p>To compile a diverse and large dataset to build our model, two datasets with experimental aqueous solubility values (LogS) were used. The first one AqSolDB &#x0005B;<xref ref-type="bibr" rid="B6">6</xref>&#x0005D;, consisting of 9,982 compounds, was generated by merging nine different aqueous solubility datasets. The second dataset was collected by Cui et al. &#x0005B;<xref ref-type="bibr" rid="B7">7</xref>&#x0005D;, which includes 9,943 compounds from ChemIDplus database and PubMed search. These two datasets were then merged as the source of the training set, validation set, and test set used in this study. In addition, two external test sets were used to further evaluate the model performance &#x0005B;<xref ref-type="bibr" rid="B7">7</xref>&#x02013;<xref ref-type="bibr" rid="B10">10</xref>&#x0005D;, including the Drug-Like Solubility-100 (DLS-100) dataset from Mitchell et al. &#x0005B;<xref ref-type="bibr" rid="B10">10</xref>&#x0005D; (external test set A) and the test set collected by Cui et al. &#x0005B;<xref ref-type="bibr" rid="B7">7</xref>&#x0005D; (external test set B), and are composed of 100 and 62 compounds respectively.</p>
<p>The merged dataset was prepared using the following methodology. First, molecules containing common elements (H, C, N, O, F, P, S, Cl, Br, and I) were kept while duplicates and large compounds (molecular weight &#x02265; 1,000) were removed. Then, molecules with a standard deviation of LogS greater or equal to 0.5 in AqSolDB were removed. Finally, compounds with high similarity with samples in the two external test sets were also filtered for the sake of validating the model in a more objective way. Herein, highly similar compounds were defined as those having a Tanimoto similarity based on extended connectivity fingerprints (ECFP) 4 larger than 0.90. A total of 12,983 molecules were retained and then randomly split into the training set, validation set, and test set with a proportion of 60&#x00025;, 20&#x00025;, and 20&#x00025;, respectively. Overall, a total of five datasets were employed in this study, namely the training set with 7,789 compounds, the validation set with 2,579 compounds, test set with 2,579 compounds, external test set A with 100 compounds, and external test set B with 62 compounds.</p>
</sec>
<sec><title>Descriptors</title>
<p>After curating the dataset, a set of physicochemical descriptors, computed with the PyDPI software &#x0005B;<xref ref-type="bibr" rid="B11">11</xref>&#x0005D;, was used to featurize each compound. PyDPI can represent molecules by means of different types of molecular descriptors, including constitutional descriptors, topological descriptors, connectivity indices, Burden descriptors, Basak&#x02019;s information indices, electro-topological state indices, autocorrelation descriptors, charge descriptors, molecular properties, kappa shape indices, and molecular operating environment-type descriptors.</p>
<p>Originally, a total of 614 descriptors were computed for each compound. Descriptors that had zero variance among the training set were firstly removed in this study. For further selection, a Pearson correlation pairwise analysis was performed for the descriptors and only kept one descriptor randomly if two descriptors were highly correlated (Pearson correlation coefficient &#x02265; 0.90). Overall, a total of 256 descriptors were kept for the next model construction and then they were scaled to range from 0 to 1 (<xref ref-type="sec" rid="s5">Table S1</xref>). To visualize whether these descriptors could capture and magnify distinct aspects of chemical structures, principal component analysis (PCA) &#x0005B;<xref ref-type="bibr" rid="B12">12</xref>&#x0005D;, which could convert high-dimensional datasets into low-dimensional space, was performed among the training set, validation set, and test set. The feature space was visually determined by plotting the first three principal components (PC).</p>
</sec>
<sec><title>Model construction</title>
<p>In this study, three ML techniques were employed to build and select a good predictive model, including the random forest (RF), deep neural network (DNN), and massage passing neural network (MPNN). In addition, four other solubility models were used as references for performance evaluation.</p>
<sec><title>RF</title>
<p>RF &#x0005B;<xref ref-type="bibr" rid="B13">13</xref>&#x02013;<xref ref-type="bibr" rid="B15">15</xref>&#x0005D; is a supervised learning algorithm that assembles many decision trees as an ensemble. The general idea of RF is to train multiple decision trees on different subsets, sampling from the original training set and then merging the prediction results of each sub-model by taking average or voting. This popular ensemble approach takes advantage of combining different learning models on random sampling and random selection of feature sets to get a more accurate and robust performance, as well as overcomes the common overfitting problem. In our study, the hyperparameters of RF were optimized based on the root mean squared error (RMSE) in the validation set. If RMSE values were the same, then the coefficient of determination (<italic>R</italic><sup>2</sup>) metric was used. Finally, the number of trees in the forest (&#x0201C;n_estimators&#x0201D;) was 600, the number of features to consider when looking for the best split (&#x0201C;max_features&#x0201D;) was 0.2 and the out-of-bag strategy was applied (&#x0201C;oob_score &#x0003D; true&#x0201D;). This model was built with the scikit-learn Python library (version1.0.2) &#x0005B;<xref ref-type="bibr" rid="B16">16</xref>&#x0005D;.</p>
</sec>
<sec><title>DNN</title>
<p>DNN &#x0005B;<xref ref-type="bibr" rid="B17">17</xref>, <xref ref-type="bibr" rid="B18">18</xref>&#x0005D; is a feed-forward neural network that consists of one input layer, multiple hidden layers, and one output layer. Normally, the descriptors are taken into the input layer, then non-linear transformations are proceeded among the hidden layers, and finally, a prediction is produced with the output layer. Weights and biases in each layer are trained using the back-propagation technique. The architecture of the DNN model used in our study is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. A total of five hidden layers were enabled in the DNN model, each of which consisted of 1,024, 1,024, 512, 512, and 256 nodes respectively. The rectified linear unit (ReLU) function was chosen as the activation function. An Adam weight optimization solver was used, and the learning rate was initialized to 0.001 and decayed with a factor of 0.8 every 5 epochs &#x0005B;<xref ref-type="bibr" rid="B19">19</xref>&#x0005D;. The batch gradient descent strategy was employed to train the DNN model with a maximum epoch of 300. Model optimization was performed with an early stopping strategy based on the best results in the validation set to avoid overfitting. The patience, the number of epochs to wait before an early stop if no progress on the validation set, was set to 15. Three dropout layers were used to further avoid overfitting of the DNN model. The model was built with the PyTorch framework (version 1.10.2) &#x0005B;<xref ref-type="bibr" rid="B20">20</xref>, <xref ref-type="bibr" rid="B21">21</xref>&#x0005D;.</p>
<fig id="F1" position="float"><label>Figure 1.</label><caption><p>The architecture of the (A) 5-layer DNN and (B) MPNN models</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g001.tif"/></fig>
</sec>
<sec><title>MPNN</title>
<p>The concept of an MPNN model &#x0005B;<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B23">23</xref>&#x0005D; is taking a molecule as a graph where an atom is a node, and a bond is an edge. An MPNN model usually contains three phases, an initial phase, a message-passing phase, and a readout phase. The nodes (atoms) and edges (bonds) are firstly initialized with atom features <italic>x<sub>v</sub></italic> or bond features <italic>e<sub>vw</sub></italic> which are listed in <xref ref-type="fig" rid="F1">Figure 1</xref> and <xref ref-type="sec" rid="s5">Table S2</xref>. In the message passing phase, it consists of T steps, which are set to 3 in this work. On each step <italic>t</italic> for each node <italic>v</italic>, its hidden state 
<inline-formula><mml:math id="m1" display="inline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>
 is updated to 
<inline-formula><mml:math id="m2" display="inline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>
 by passing the message 
<inline-formula><mml:math id="m3" display="inline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>
 of its neighbors (bonded atoms) and edges with a message function <italic>M<sub>t</sub></italic> and update function <italic>U<sub>t</sub></italic>.
<disp-formula><mml:math id="m4" display='block'><mml:mrow><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>N</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>w</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<disp-formula><mml:math id="m5" display='block'><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>U</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula>
Where <italic>N(v)</italic> is the set of neighbor nodes of <italic>v</italic>. For simplicity, the 
<inline-formula><mml:math id="m6" display="inline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mn>0</mml:mn></mml:msubsup></mml:mrow></mml:math></inline-formula>
 was set to <italic>x<sub>v</sub></italic> in this study. And finally, a readout function <italic>R</italic> is used to make a prediction based on the final states 
<inline-formula><mml:math id="m7" display="inline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mi>T</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>
. In our model, the readout phase was implemented by summing up the catenation of the initial and final states of all nodes in a molecule. Then the model makes a solubility prediction with a 2-layer neural network by feeding up <italic>h</italic> as follows:
<disp-formula><mml:math id="m8" display='block'><mml:mrow><mml:mi>h</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>v</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>N</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mn>0</mml:mn></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow> <mml:mi>v</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<disp-formula><mml:math id="m9" display='block'><mml:mrow><mml:mover accent='true'><mml:mi>y</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>h</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula>
</p>
<p>The MPNN model was trained by means of the batch gradient descent method with a batch size of 128 and optimized with an Adam optimizer. The learning rate was initialized to 0.001 and decayed with a factor of 0.9 every 3 epochs. Also, the model was optimized with the early stopping strategy and whose patience parameter was set to 7. In this study, the MPNN model was implemented based on the framework of variant MPNN-S &#x0005B;<xref ref-type="bibr" rid="B24">24</xref>&#x0005D; and PyTorch &#x0005B;<xref ref-type="bibr" rid="B21">21</xref>&#x0005D;.</p>
</sec>
<sec><title>Baseline models</title>
<p>To compare our model performance, four models were used as reference models, including two ML models built by ourselves and two other publicly available models.</p>
<p>The first reference model was constructed with graph convolutional (GraphConv) method &#x0005B;<xref ref-type="bibr" rid="B25">25</xref>&#x0005D;. Similar to the MPNN model, the GraphConv model also treats the chemical structure as a graph and represents the graph with atom-based and bond-based properties. And then convolutional and pooling layers are used to update the information of each node by aggregating the information of its connected nodes. In this study, we built a GraphConv model by applying the default implementation from the DeepChem library (version 2.4.0) &#x0005B;<xref ref-type="bibr" rid="B26">26</xref>&#x0005D;, which contains one GraphConv layer and one dense layer, and this model was used as a reference model for later comparison. The number of training epochs was optimized based on the performance in the validation set, which was finally set to 1,500.</p>
<p>In addition, two public models, namely ALOGPS 2.1 &#x0005B;<xref ref-type="bibr" rid="B27">27</xref>&#x0005D;, and ESOL equation &#x0005B;<xref ref-type="bibr" rid="B28">28</xref>&#x0005D;, were also included for the performance comparison in this study. For the ALOGPS 2.1 it employed molecular weights and electro-topological state indices as descriptors and neural network techniques for model construction. For the ESOL equation, it is a simple linear model. This linear regression model considered four descriptors, including LogP, molecular weight, number of rotatable bonds, and proportion of heavy atoms in aromatic systems.</p>
<p>Furthermore, the four descriptors from ESOL equation were combined with the RF algorithm to build another reference model by ourselves &#x0005B;RF with ESOL descriptors (RF_ESOL)&#x0005D;. It was also trained in our training set and hyperparameters were optimized in the validation set. The hyperparameters &#x0201C;n_estimators&#x0201D; and &#x0201C;min_samples_split&#x0201D; of this RF_ESOL model were set to 800 and 4, respectively. Also, the scikit-learn Python library (version 0.23.0) was employed to build this regression model.</p>
</sec>
</sec>
<sec><title>Evaluation metrics</title>
<p>The predictive performance of our solubility models was assessed by four metrics, including <italic>R</italic><sup>2</sup>, RMSE, &#x00025;LogS &#x000B1; 0.7, and &#x00025;LogS &#x000B1; 1.0. See Supplementary materials for the definition of <italic>R</italic><sup>2</sup> and RMSE. Another two metrics &#x00025;LogS &#x000B1; 0.7 and &#x00025;LogS &#x000B1; 1.0 proposed by Boobier et al. &#x0005B;<xref ref-type="bibr" rid="B2">2</xref>&#x0005D; have also been used and are defined below:</p>
<p>The &#x00025;LogS &#x000B1; 0.7 is defined as the percentage of compounds where the predicted LogS is in the range of experimental LogS &#x000B1; 0.7. The &#x00025;LogS &#x000B1; 1.0 is defined as the percentage of compounds where the predicted LogS is in the range of experimental LogS &#x000B1; 1.0.</p>
<p>The rationale of these two metrics was that an experimental error of &#x000B1; 0.5&#x02013;0.7 exists for aqueous LogS value in literature &#x0005B;<xref ref-type="bibr" rid="B29">29</xref>&#x0005D;, resulting from variations in temperature, pH, and solvent purity. It would influence the reliability of <italic>R</italic><sup>2</sup> and RMSE in evaluating model performance as they are dependent on the range of LogS in the model. Considering the effect of experimental error, &#x00025;LogS &#x000B1; 0.7 could help the users understand the maximum accuracy of the model and &#x00025;LogS &#x000B1; 1.0 sets a limitation of the usefulness of the model for the development process.</p>
<p>As the test sets only contained a limited number of samples and the unavoidable experimental errors of LogS, the evaluation results may be biased. Thus, the bootstrapping method &#x0005B;<xref ref-type="bibr" rid="B30">30</xref>&#x02013;<xref ref-type="bibr" rid="B32">32</xref>&#x0005D; was chosen for the analysis of the confidence interval as it is a convenient and recommended strategy to estimate the properties of estimators for any distribution with limited samples. In brief, the bootstrap sampling in our study was conducted as follows. Random sampling of 10,000 redundant copies with replacements was conducted on the test set. Each copy had the same size as the original test set. For example, the total sample size of the test set, and external test sets A and B were 2,597, 100, and 62, respectively. Then, the developed model was re-evaluated on each redundant copy of the test set with three performance metrics. As a result, an ensemble of 10,000 bootstrap samples was obtained for each performance metric, and a certain confidence interval (e.g., 95&#x00025;) was derived accordingly. In this study, the percentile bootstrap method was used to compute the 95&#x00025; confidence interval.</p>
</sec>
<sec><title>Model interpretation methods</title>
<p>In terms of model interpretation, different methods have been evaluated such as the Shapley Additive exPlanations (SHAP) and heat map coloring. Herein, the SHAP method &#x0005B;<xref ref-type="bibr" rid="B33">33</xref>&#x0005D; is a feature-based interpretation method, which originated from a game theory approach &#x0005B;<xref ref-type="bibr" rid="B34">34</xref>&#x0005D;. It is a local interpretable approach that can explain the feature importance on an individual instance or a group of instances for any ML model. The computed SHAP value for a specific feature represents both the magnitude and direction of its contribution to the prediction. Feature with a positive sign has a positive contribution while a negative sign indicates a negative contribution to the model prediction. The work from Rodr&#x000ED;guez-P&#x000E9;rez and Bajorath &#x0005B;<xref ref-type="bibr" rid="B35">35</xref>&#x0005D; in 2020 has shown a promising application of SHAP analysis in ML model interpretation. There are some variants for implementing SHAP and TreeSHAP &#x0005B;<xref ref-type="bibr" rid="B36">36</xref>&#x0005D; is used to interpretate our RF model in this study as it is a fast and tree-based model-specific method for producing feature attributions.</p>
<p>For the heat map coloring strategy, it is usually applied to color on the atomic or fragmental contribution to a molecular property on a two-dimensional (2D) structure, and it provides a direct interpretative visualization to the chemist. To compute the atom-level or fragment-level importance in a given prediction, those descriptors associated with an atom or fragment are removed and the change produced in a new prediction is associated to the removed atom or fragment. Thus, this method is also known as atom removal explanation. Similarity maps &#x0005B;<xref ref-type="bibr" rid="B37">37</xref>&#x0005D;, the universal approach &#x0005B;<xref ref-type="bibr" rid="B38">38</xref>&#x0005D;, and the atom-coloring scheme &#x0005B;<xref ref-type="bibr" rid="B39">39</xref>&#x0005D; are different implementations of this strategy. In this study, we followed the atom-coloring scheme framework to compute the atom or fragment contribution on a chemical structure. The protocol for atom-coloring and fragment-coloring used in this study is shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. Specifically, we mask each heavy atom or fragment atoms as dummy atom(s) &#x0005B;<xref ref-type="bibr" rid="B40">40</xref>&#x0005D; and transform bond between dummy atoms into a zero bond and the bond between non-dummy and dummy atoms into a single bond. This is different from other atom removal methods like the atom-coloring method where the removed atoms are replaced by a sodium atom. Bonds are also treated differently than in the universal approach where they propose to remove bonds between the interpretated fragment and the remaining structures. Herein, the idea of this masking strategy is that we want to account for the nonadditive effects by making the masked molecule to inherit inherent structural information (such as the links between atoms) from the reference (unmasked) molecule as much as possible. The dummy atom has &#x0201C;blank properties&#x0201D; (zero molecular weight and formal charge) which would help us minimize the inherent impact of the atom replacer on the new replacing molecule. Then we recalculate the descriptors, predict the solubility of the masked molecules and calculate the difference of predicted LogS between the masked and unmasked molecule, assigning the difference as the contribution of this atom or fragment to the molecule. The interpretation image is drawn with the open-source software RDKit &#x0005B;<xref ref-type="bibr" rid="B41">41</xref>&#x0005D;. Herein, we provide a script for automatically fragmenting a molecule into functional groups, rings, and other fragments and the chemist could also manually fragment it to meet their personalized study.</p>
<fig id="F2" position="float"><label>Figure 2.</label><caption><p>General protocol of heat map coloring. (A) Atom-coloring scheme; (B) fragment-coloring scheme</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g002.tif"/></fig>
<p>For a clearer visualization of the interpretation results, single molecule normalization was performed for all contribution values of atoms or fragments. Herein, single molecule normalization enables us to see small differences between atoms/fragments of a compound. The normalized contribution of an atom or fragment <italic>i</italic> (
<inline-formula><mml:math id="m10" display="inline"><mml:mrow><mml:msubsup><mml:mrow><mml:mtext>&#x00394;</mml:mtext></mml:mrow> <mml:mi>i</mml:mi><mml:mo>'</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>) was computed as:
<disp-formula><mml:math id="m11" display='block'><mml:mrow><mml:msubsup><mml:mrow><mml:mtext>&#x00394;</mml:mtext></mml:mrow> <mml:mi>i</mml:mi><mml:mo>'</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mtext>&#x00394;</mml:mtext></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mrow><mml:mtext>&#x00394;</mml:mtext></mml:mrow><mml:mrow><mml:mtext>min</mml:mtext></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext>&#x00394;</mml:mtext></mml:mrow><mml:mrow><mml:mtext>max</mml:mtext></mml:mrow></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mrow><mml:mtext>&#x00394;</mml:mtext></mml:mrow><mml:mrow><mml:mtext>min</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mtext>max</mml:mtext></mml:mrow></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mtext>min</mml:mtext></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mtext>min</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>
Where &#x00394;<sub><italic>i</italic></sub> is the contribution value of atom or fragment <italic>i</italic> and &#x00394;<italic><sub>max</sub></italic> and &#x00394;<italic><sub>min</sub></italic> are the maximum and minimum contribution values found in a compound, respectively. Finally, <italic>I<sub>max</sub></italic> and <italic>I<sub>min</sub></italic> are the normalization range. For a given prediction:</p>
<p>
<list list-type="simple">
<list-item>
<label>(1)</label>
<p>If all the atomic or fragmental contributions &#x02265; 0, then normalize to &#x0005B;0, 1&#x0005D;.</p>
</list-item>
<list-item>
<label>(2)</label>
<p>If all the contributions &#x02264; 0, then normalize to &#x0005B;&#x02013;1, 0&#x0005D;.</p>
</list-item>
<list-item>
<label>(3)</label>
<p>Otherwise, the normalization range is set to &#x0005B;&#x02013;1, 1&#x0005D; (most cases).</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="s3"><title>Results</title>
<sec><title>Dataset properties</title>
<p>Before building the models, a property analysis was performed on the different datasets to ensure they were balanced and with a reasonable degree of variability in the solubility values. The experimental solubility (S<sub>exp</sub>) distributions of these datasets are shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. A similar and diverse distribution was found among the training set, validation set, test set, and external test set A. However, external test set B has a more biased property distribution towards less soluble molecules.</p>
<fig id="F3" position="float"><label>Figure 3.</label><caption><p>Distribution of experimental LogS and molecular property</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g003.tif"/></fig>
</sec>
<sec><title>Descriptor analysis</title>
<p>A set of physicochemical descriptors were used to represent the molecule and a PCA analysis for these descriptors was also performed. The PCA analysis results are shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, where points have been colored based on their experimental LogS. In this study, a compound is classified as an insoluble molecule if its LogS is less than &#x02013;2.0, otherwise it is considered soluble. From <xref ref-type="fig" rid="F4">Figure 4</xref>, we can see that the chemical space described with the physicochemical descriptors is diverse while the partitioning of soluble and insoluble compounds is also visible. Most of the soluble molecules (blue points) are located on the inner side while most of the insoluble ones (red points) are on the right and outer side, demonstrating the ability of these descriptors to identify soluble and insoluble molecules. The PCA analysis of three datasets also shows that they share a similar distribution.</p>
<fig id="F4" position="float"><label>Figure 4.</label><caption><p>PCA analysis of descriptor space for three datasets. (A) Training set; (B) validation set; (C) test set</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g004.tif"/></fig>
</sec>
<sec><title>Model performance</title>
<p>Three ML models were developed based on our training set with 7,789 compounds and then were evaluated on three test sets using four performance metrics. All performance results are depicted in <xref ref-type="fig" rid="F5">Figure 5</xref> and <xref ref-type="sec" rid="s5">Tables S3</xref>&#x02013;<xref ref-type="sec" rid="s5">5</xref>. Across the three datasets, the three RF &#x0005B;RF with default descriptors (RF_Property)&#x0005D;, DNN (DNN_Property), and MPNN models obtained a comparable and better performance than that from the four reference models. The RF_Property model showed a consistently excellent performance among the three test sets with different metrics. The RMSE of the four reference models in the test set were 1.06, 1.16, 1.22, and 1.04 respectively while our three developed models (RF_Property, DNN_Property, and MPNN model) in the test set were all 0.90, stating that our developed models have a better predictive ability.</p>
<fig id="F5" position="float"><label>Figure 5.</label><caption><p>Model performance. (A) Coefficient of determination results; (B) RMSE results; (C) &#x00025;LogS &#x000B1; 0.7; (D) &#x00025;LogS &#x000B1; 1.0</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g005.tif"/></fig>
<p>The three developed ML models showed less difference in the test set than those in two external test sets. As the source of training set, validation set, and test set were randomly split from a curated dataset, they had a similar distribution and shared a system error. Thus, the external test sets were very important to assess solubility predictive models objectively. In the external test set A/B, the <italic>R</italic><sup>2</sup> of RF_Property, DNN_Property, and MPNN were 0.795/0.490, 0.780/0.507, and 0.744/0.361, and &#x00025;LogS &#x000B1; 1.0 of them were 0.850/0.887, 0.770/0.887, and 0.740/0.850, respectively. This shows that the RF_Property model is better than the DNN_Property and MPNN models. It is not surprising as tree-based models perform better on tabular-style datasets than standard deep models &#x0005B;<xref ref-type="bibr" rid="B36">36</xref>&#x0005D; and a systematic study from Jiang et al. &#x0005B;<xref ref-type="bibr" rid="B42">42</xref>&#x0005D; also demonstrated that descriptor-based models could achieve better or comparable performance in the predictions of many molecular properties. Among the external test set B, which contains 62 compounds under pH 7, the simple linear model ESOL equation got a comparable performance (RMSE was 0.64) with that from RF_Property (RMSE was 0.63) while some other ML models obtained worse results, e.g., the RMSE of MPNN and GraphConv were 0.71 and 0.90. For the other two test sets, RF_ESOL performed better than the linear model, but worse than the physicochemical descriptors with RF or DNN algorithms. These results show that descriptors and non-linear ML techniques are important for the quality of the final model.</p>
</sec>
<sec><title>Outlier analysis</title>
<p>For those compounds with absolute error larger than 1.0 (error bars in <xref ref-type="fig" rid="F6">Figure 6</xref>), some recurrent substructures were found (<xref ref-type="fig" rid="F7">Figure 7</xref>) such as nitrogen-containing heterocycles and aromatic systems. In the external test set B, the absolute errors of two most soluble molecules were larger than other compounds. From <xref ref-type="fig" rid="F3">Figure 3</xref>, we could see that highly soluble compounds (LogS &#x0003E; 0.00) were less distributed in the training set, which may result in these two outliers. For the outlier with name of KEMDOW, its Crippen LogP was &#x02013;1.735 which may lead to the experimental error.</p>
<fig id="F6" position="float"><label>Figure 6.</label><caption><p>Scatter plot of experimental and predicted LogS from RF_Property model (the error bars are computed from the difference between predicted and experimental LogS). (A) External test set A; (B) external test set B</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g006.tif"/></fig>
<fig id="F7" position="float"><label>Figure 7.</label><caption><p>Some chemical structures of outliers. The caption under each structure is the molecule name, experimental LogS (predicted LogS from RF_Property model)</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g007.tif"/></fig>
</sec>
<sec><title>Solubility model interpretation</title>
<p>After evaluating the performance of different algorithms, the second part of this study evaluates the suitability of different interpretation methods for the best-performing algorithm, the RF_Property model.</p>
<p>Firstly, the TreeSHAP method was used to compute the feature importance based on 1,000 random compounds from the training set. The feature importance from TreeSHAP calculation is shown in <xref ref-type="fig" rid="F8">Figure 8</xref> and indicates that Crippen LogP (LogP) and its square (LogP2) are the most relevant descriptors in the solubility prediction and have a strong correlated relationship. The higher the LogP or LogP2 descriptor values, the lower the solubility value, which is in accordance with our intuition. Also, hydrophilic index (Hy) and some burden descriptors (bcute10, bcutm3, and bcutm4), play a role in the predictive model and their interpretation results indicate that the reduction of their values could be beneficial to improve the solubility. Interestingly, the top six most important features calibrated from Gini importance in the RF model are the same as those in the TreeSHAP method. Such descriptors could provide a simple rule of thumb for a chemist to assess the solubility of a given compound.</p>
<fig id="F8" position="float"><label>Figure 8.</label><caption><p>SHAP interpretation result of RF_Property model. The y-axis shows the most important features and in the x-axis we can see the computed SHAP values on 1,000 training samples. Positive value means a positive contribution while a negative one indicates a negative contribution to the model prediction</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g008.tif"/></fig>
<p>Although the feature importance method gives interesting insights on the relevance of specific molecular properties for solubility prediction it does not help chemists take direct actions to improve the solubility of a given compound as there is no information on which regions of the molecule are improving or reducing more the solubility of a given compound. Therefore, we also evaluated the suitability of the atom-coloring scheme and fragment-coloring scheme for model interpretation. As shown in <xref ref-type="fig" rid="F9">Figure 9</xref>, for the compounds NC61 and NC17 of the external test set B, both strategies were capable of explaining the atomic or fragmental contribution to the molecular solubility. The carbonyl group was beneficial for the solubility while the ethylene carbons and aromatic rings made negative contributions to it. In the case of compounds C-499 and C1257 of the training set, the interpretation result of fragment-coloring scheme was more robust and reasonable than that of atom-coloring scheme. Both schemes could show the modification from carbon to hydroxy group was helpful for improving molecular solubility. The aromatic rings and carbon atoms would decrease the solubility while the hydroxy group and ester functional group were indicated to make positive contributions in both compounds from the fragment-coloring results. However, in the atom-coloring results, the contribution of aromatic rings and carbons is not consistent as the overall color is heavily influenced by the overall prediction value (highly soluble compounds will tend to paint all atoms as having a positive influence and vice versa). This phenomenon was similar to the conclusion from Sheridan&#x02019;s work &#x0005B;<xref ref-type="bibr" rid="B39">39</xref>&#x0005D; that atom-level coloration was not robust enough and indicates that for this model, fragment-based coloring is more suitable. Therefore, in the following part, we will focus on discussing the fragment-coloring interpretation. It is worth noting that the heat map coloring and normalization method only consider the difference within the molecule. We should focus on the relative values of the intramolecular contribution, and it was not fair to compare intermolecular atomic or fragmental contribution which was dependent on the molecule.</p>
<fig id="F9" position="float"><label>Figure 9.</label><caption><p>Example of atom- and fragment-coloring scheme. The caption of each structure is the molecule name, experimental LogS (predicted LogS from RF_Property model). S<sub>pred</sub>: predicted solubility</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g009.tif"/></fig>
<p>The second example of fragment-coloring scheme includes the six poly-ADP-ribose-polymerase (PARP) inhibitors designed by Johannes et al. &#x0005B;<xref ref-type="bibr" rid="B43">43</xref>&#x0005D;. In their work, they applied structure- and property-based strategies for drug design and observed a series of compounds that showed excellent efficacy to the target. And they also measured aqueous solubility under pH 7.4 condition for some of the active compounds, which provides good examples for interpretation in our study. The interpretation results for six compounds (all of them are excluded in the datasets for model construction of this study) from the fragment-based coloring strategy are shown in <xref ref-type="fig" rid="F10">Figure 10</xref>. Their most relevant SHAP descriptors and atom-coloring results are also shown in <xref ref-type="sec" rid="s5">Table S6</xref> and <xref ref-type="sec" rid="s5">Figure S1</xref> respectively. As we could see, our RF_Property model has a good predictive ability for most of compounds and the interpretation results are consistently stable. These six compounds shared the same scaffold and the modified substructures ranged from an aromatic ring to an ethyl group. As we can see from <xref ref-type="fig" rid="F10">Figure 10</xref>, the shared piperazine and imidazole fragments were proposed to make positive contribution to the S<sub>pred</sub> whereas the fragments themselves were soluble in water. The shared benzene was proposed to make most of the negative contribution to the S<sub>pred</sub>. For the highly insoluble (predicted LogS &#x02264; 4) compounds P10&#x02013;P12, the modified part, which was benzene, pyridine, and cyclohexene ring, was predicted to hinder or hardly affect the molecular solubility. And the other modification in slightly insoluble (&#x02013;4 &#x0003C; predicted LogS &#x02264; &#x02013;2) compounds P13&#x02013;P15 made positive or almost zero contribution to solubility improvement. It is also interesting to see that replacing a carbon with a nitrogen or oxygen within the ring system was helpful to improve the solubility. For example, the contribution of modified benzene ring in P10 was similar to the fixed benzene, while the pyridine ring was less negative than the fixed benzene within the P11 compound. Such a similar phenomenon was also observed in compounds P12, P13, and P14. In general, the interpretation results in <xref ref-type="fig" rid="F10">Figure 10</xref> were in line with instinctive chemical knowledge, showing a good interpretation power of our model and fragment-based coloring method. Non-normalized results can be found in <xref ref-type="sec" rid="s5">Figure S2</xref> and the same color distribution but with different intensities for different molecules depending on the S<sub>pred</sub> value is shown.</p>
<fig id="F10" position="float"><label>Figure 10.</label><caption><p>Fragment-coloration results of six PARP inhibitors. The caption of each structure is the molecule name, experimental LogS (predicted LogS from RF_Property model)</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g010.tif"/></fig>
<p>Previous examples highlight standard modifications that could be applied by a chemist to improve the solubility with the addition of more polar atoms. In our study, four &#x0201C;abnormal&#x0201D; compounds not present in the training or validation sets were also used to validate the interpretability of our model. Three of these molecules are immunomodulatory drugs, namely thalidomide, lenalidomide, and pomalidomide, and the fourth (EM-12) is a related derivative extracted from &#x0005B;<xref ref-type="bibr" rid="B44">44</xref>&#x0005D;. For these four compounds, replacing a methylene with a carbonyl group was reported to decrease the molecular solubility, which is against our chemical intuition to some extent. The theoretical study stated that the carbonyl group could have an extended &#x003C0;-conjugation with carbonyl groups of the right part through the nitrogen and such an extended &#x003C0;-electron led to a lower solubility. Our interpretation results (<xref ref-type="fig" rid="F11">Figure 11</xref>) showed that the carbonyl groups in thalidomide and pomalidomide were proposed to make negative contribution to the S<sub>pred</sub> while the carbonyl groups in EM-12 and lenalidomide contributed positively, in line with the prospective that the modification from one carbonyl to methylene could make a positive contribution to the S<sub>pred</sub> by hindering the internal &#x003C0;-conjugation. On the other hand, if we compare thalidomide and pomalidomide, the addition of an amine group was proposed to make a negative contribution to the S<sub>pred</sub>. The intramolecular hydrogen bond formed by the amine and the nearest carbonyl oxygen would be unfavorable which was supported by the experimental solubility value. This added amine group was shown to make zero contribution in the lenalidomide, and its experimental and S<sub>pred</sub> were almost the same as that of EM-12.</p>
<fig id="F11" position="float"><label>Figure 11.</label><caption><p>Fragment-coloration results of four compounds. The caption of each structure is the molecule name, experimental LogS (predicted LogS from RF_Property model)</p></caption><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="eds-01-100826-g011.tif"/></fig>
</sec>
</sec>
<sec id="s4"><title>Discussion</title>
<p>In this study, several ML models for predicting aqueous solubility of small molecules have been proposed and evaluated. From all the evaluated algorithms and descriptors, the RF_Property model, combining physicochemical descriptors and RF technique, has obtained the best performance among three different test sets assessed by different metrics.</p>
<p>From the interpretation perspective, we have shown that feature importance extraction provides valuable information on the most relevant descriptors and showed that LogP and Hy descriptors play an important role in solubility prediction.</p>
<p>Feature importance, however, does not directly help the ligand optimization process which can benefit more from the extraction of heat map coloring. In this area, we have shown that fragment-based coloring offers a more robust interpretation than atom-based coloring and that normalizing the values further improves it. Such visualization can offer guidance for chemists to modify the structure with a desired property. This strategy has been evaluated in the domain of solubility prediction but could also be applied and validated in other research fields, such as activity prediction and ADMET property prediction, to improve the interpretability of ML models. The implementation used in this paper can be downloaded from <ext-link ext-link-type="uri" xlink:href="https://github.com/Pharmacelera/predictive-model-coloring">https://github.com/Pharmacelera/predictive-model-coloring</ext-link>.</p>
</sec>
</body>
<back>
<glossary><title>Abbreviations</title>
<def-list>
<def-item><term>DNN:</term><def><p>deep neural network</p></def></def-item>
<def-item><term>GraphConv:</term><def><p>graph convolutional</p></def></def-item>
<def-item><term>ML:</term><def><p>machine learning</p></def></def-item>
<def-item><term>MPNN:</term><def><p>massage passing neural network</p></def></def-item>
<def-item><term>PCA:</term><def><p>principal component analysis</p></def></def-item>
<def-item><term>RF:</term><def><p>random forest</p></def></def-item>
<def-item><term>RMSE:</term><def><p>root mean squared error</p></def></def-item>
<def-item><term>SHAP:</term><def><p>SHapley Additive exPlanations</p></def></def-item>
<def-item><term>S<sub>pred</sub>:</term><def><p>predicted solubility</p></def></def-item>
</def-list>
</glossary>
<sec id="s5"><title>Supplementary materials</title>
<p>The supplementary material for this article is available at: <ext-link ext-link-type="uri" xlink:href="https://www.explorationpub.com/uploads/Article/file/100826_sup_1.pdf">https://www.explorationpub.com/uploads/Article/file/100826_sup_1.pdf</ext-link>.</p>
</sec>
<sec id="s6"><title>Declarations</title>
<sec><title>Author contributions</title>
<p>MS: Conceptualization, Investigation, Data curation, Writing&#x02014;original draft, Writing&#x02014;review &#x00026; editing. EH: Conceptualization, Supervision, Writing&#x02014;review &#x00026; editing.</p>
</sec>
<sec><title>Conflicts of interest</title>
<p>The authors declare that they have no conflicts of interest.</p>
</sec>
<sec><title>Ethical approval</title>
<p>Not applicable.</p>
</sec>
<sec><title>Consent to participate</title>
<p>Not applicable.</p>
</sec>
<sec><title>Consent to publication</title>
<p>Not applicable.</p>
</sec>
<sec><title>Availability of data and materials</title>
<p>Solubility data was extracted from AqSolDB (<ext-link ext-link-type="uri" xlink:href="https://dataverse.harvard.edu/dataset.xhtml?persistentId&#x0003D;doi:10.7910/DVN/OVHAW8">https://dataverse.harvard.edu/dataset.xhtml?persistentId&#x0003D;doi:10.7910/DVN/OVHAW8</ext-link>) and a dataset from Cui et al. (<ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2020.00121/full&#x00023;supplementary-material">https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2020.00121/full&#x00023;supplementary-material</ext-link>). In addition, two external test sets were used; the DLS-100 solubility dataset (<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.17630/3a3a5abc-8458-4924-8e6c-b804347605e8">http://dx.doi.org/10.17630/3a3a5abc-8458-4924-8e6c-b804347605e8</ext-link>) and the test set from Cui et al. (<ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2020.00121/full&#x00023;supplementary-material">https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2020.00121/full&#x00023;supplementary-material</ext-link>). The code for low-variance feature filtering can be found in the Supplementary materials. The implementation used in this paper can be downloaded from <ext-link ext-link-type="uri" xlink:href="https://github.com/Pharmacelera/predictive-model-coloring">https://github.com/Pharmacelera/predictive-model-coloring</ext-link>.</p>
</sec>
<sec><title>Funding</title>
<p>This study was partially funded by the European Commission under grant &#x0005B;953418&#x0005D; and by the Spanish Ministry of Science and Innovation under grant &#x0005B;PTQ2020-011237&#x0005D;. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</p>
</sec>
<sec><title>Copyright</title>
<p>&#x000A9; The Author(s) 2023.</p>
</sec>
</sec>
<ref-list>
<ref id="B1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gozalbes</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Pineda-Lucena</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>QSAR-based solubility model for drug-like compounds</article-title>
<source>Bioorg Med Chem</source>
<year iso-8601-date="2010">2010</year>
<volume>18</volume>
<fpage>7078</fpage>
<lpage>84</lpage>
<pub-id pub-id-type="doi">10.1016/j.bmc.2010.08.003</pub-id><pub-id pub-id-type="pmid">20810286</pub-id></element-citation>
</ref>
<ref id="B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boobier</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hose</surname>
<given-names>DRJ</given-names>
</name>
<name>
<surname>Blacker</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>BN</given-names>
</name>
</person-group>
<article-title>Machine learning with physicochemical relationships: solubility prediction in organic solvents and water</article-title>
<source>Nat Commun</source>
<year iso-8601-date="2020">2020</year>
<volume>11</volume>
<elocation-id>5753</elocation-id>
<pub-id pub-id-type="doi">10.1038/s41467-020-19594-z</pub-id><pub-id pub-id-type="pmid">33188226</pub-id><pub-id pub-id-type="pmcid">PMC7666209</pub-id></element-citation>
</ref>
<ref id="B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Palmer</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>O’Boyle</surname>
<given-names>NM</given-names>
</name>
<name>
<surname>Glen</surname>
<given-names>RC</given-names>
</name>
<name>
<surname>Mitchell</surname>
<given-names>JB</given-names>
</name>
</person-group>
<article-title>Random forest models to predict aqueous solubility</article-title>
<source>J Chem Inf Model</source>
<year iso-8601-date="2007">2007</year>
<volume>47</volume>
<fpage>150</fpage>
<lpage>8</lpage>
<pub-id pub-id-type="doi">10.1021/ci060164k</pub-id><pub-id pub-id-type="pmid">17238260</pub-id></element-citation>
</ref>
<ref id="B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rudin</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead</article-title>
<source>Nat Mach Intell</source>
<year iso-8601-date="2019">2019</year>
<volume>1</volume>
<fpage>206</fpage>
<lpage>15</lpage>
<pub-id pub-id-type="doi">10.1038/s42256-019-0048-x</pub-id><pub-id pub-id-type="pmid">35603010</pub-id><pub-id pub-id-type="pmcid">PMC9122117</pub-id></element-citation>
</ref>
<ref id="B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodríguez-Pérez</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bajorath</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Explainable machine learning for property predictions in compound optimization</article-title>
<source>J Med Chem</source>
<year iso-8601-date="2021">2021</year>
<volume>64</volume>
<fpage>17744</fpage>
<lpage>52</lpage>
<pub-id pub-id-type="doi">10.1021/acs.jmedchem.1c01789</pub-id><pub-id pub-id-type="pmid">34902252</pub-id></element-citation>
</ref>
<ref id="B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sorkun</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Khetan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Er</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds</article-title>
<source>Sci Data</source>
<year iso-8601-date="2019">2019</year>
<volume>6</volume>
<elocation-id>143</elocation-id>
<pub-id pub-id-type="doi">10.1038/s41597-019-0151-1</pub-id><pub-id pub-id-type="pmid">31395888</pub-id><pub-id pub-id-type="pmcid">PMC6687799</pub-id></element-citation>
</ref>
<ref id="B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ni</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>YD</given-names>
</name>
<etal>et al.</etal>
</person-group>
<article-title>Improved prediction of aqueous solubility of novel compounds by going deeper with deep learning</article-title>
<source>Front Oncol</source>
<year iso-8601-date="2020">2020</year>
<volume>10</volume>
<elocation-id>121</elocation-id>
<pub-id pub-id-type="doi">10.3389/fonc.2020.00121</pub-id><pub-id pub-id-type="pmid">32117768</pub-id><pub-id pub-id-type="pmcid">PMC7026387</pub-id></element-citation>
</ref>
<ref id="B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McDonagh</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Nath</surname>
<given-names>N</given-names>
</name>
<name>
<surname>De</surname>
<given-names>Ferrari L</given-names>
</name>
<name>
<surname>van Mourik</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Mitchell</surname>
<given-names>JB</given-names>
</name>
</person-group>
<article-title>Uniting cheminformatics and chemical theory to predict the intrinsic aqueous solubility of crystalline druglike molecules</article-title>
<source>J Chem Inf Model</source>
<year iso-8601-date="2014">2014</year>
<volume>54</volume>
<fpage>844</fpage>
<lpage>56</lpage>
<pub-id pub-id-type="doi">10.1021/ci4005805</pub-id><pub-id pub-id-type="pmid">24564264</pub-id><pub-id pub-id-type="pmcid">PMC3965570</pub-id></element-citation>
</ref>
<ref id="B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boobier</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Osbourn</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mitchell</surname>
<given-names>JBO</given-names>
</name>
</person-group>
<article-title>Can human experts predict solubility better than computers?</article-title>
<source>J Cheminform</source>
<year iso-8601-date="2017">2017</year>
<volume>9</volume>
<elocation-id>63</elocation-id>
<pub-id pub-id-type="doi">10.1186/s13321-017-0250-y</pub-id><pub-id pub-id-type="pmid">29238891</pub-id><pub-id pub-id-type="pmcid">PMC5729181</pub-id></element-citation>
</ref>
<ref id="B10">
<label>10</label>
<element-citation publication-type="web">
<person-group person-group-type="author">
<name>
<surname>Mitchell</surname>
<given-names>JBO</given-names>
</name>
<name>
<surname>McDonagh</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Boobier</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>DLS-100 solubility dataset [Internet]</article-title>
<comment>University of St Andrews; [cited 2017 Oct 27]. Available from: <uri xlink:href="http://dx.doi.org/10.17630/3a3a5abc-8458-4924-8e6c-b804347605e8">http://dx.doi.org/10.17630/3a3a5abc-8458-4924-8e6c-b804347605e8</uri></comment>
</element-citation>
</ref>
<ref id="B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cao</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>YZ</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>GS</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>QS</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>PyDPI: freely available python package for chemoinformatics, bioinformatics, and chemogenomics studies</article-title>
<source>J Chem Inf Model</source>
<year iso-8601-date="2013">2013</year>
<volume>53</volume>
<fpage>3086</fpage>
<lpage>96</lpage>
<pub-id pub-id-type="doi">10.1021/ci400127q</pub-id><pub-id pub-id-type="pmid">24047419</pub-id></element-citation>
</ref>
<ref id="B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hotelling</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Analysis of a complex of statistical variables into principal components</article-title>
<source>J Educ Psychol</source>
<year iso-8601-date="1933">1933</year>
<volume>24</volume>
<fpage>498</fpage>
<lpage>520</lpage>
<pub-id pub-id-type="doi">10.1037/h0071325</pub-id></element-citation>
</ref>
<ref id="B13">
<label>13</label>
<element-citation publication-type="confproc">
<person-group person-group-type="editor">
<name>
<surname>Ho</surname>
<given-names>TK</given-names>
</name>
</person-group>
<comment>Random decision forests. Proceedings of the 3rd International Conference on Document Analysis and Recognition; 1995 Aug 14; Montreal, Canada. ICDAR; 1995. pp. 278–82.</comment>
<pub-id pub-id-type="doi">10.1109/ICDAR.1995.598994</pub-id></element-citation>
</ref>
<ref id="B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ho</surname>
<given-names>TK</given-names>
</name>
</person-group>
<article-title>The random subspace method for constructing decision forests</article-title>
<source>IEEE Trans Pattern Anal Mach Intell</source>
<year iso-8601-date="1998">1998</year>
<volume>20</volume>
<fpage>832</fpage>
<lpage>44</lpage>
<pub-id pub-id-type="doi">10.1109/34.709601</pub-id></element-citation>
</ref>
<ref id="B15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breiman</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Random forests</article-title>
<source>Mach Learn</source>
<year iso-8601-date="2001">2001</year>
<volume>45</volume>
<fpage>5</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id></element-citation>
</ref>
<ref id="B16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pedregosa</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Varoquaux</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gramfort</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Michel</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Thirion</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Grisel</surname>
<given-names>O</given-names>
</name>
<etal>et al.</etal>
</person-group>
<article-title>Scikit-learn: machine learning in Python</article-title>
<source>J Mach Learn Res</source>
<year iso-8601-date="2011">2011</year>
<volume>12</volume>
<fpage>2825</fpage>
<lpage>30</lpage>
</element-citation>
</ref>
<ref id="B17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rosenblatt</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Principles of neurodynamics. Perceptrons and the theory of brain mechanisms</article-title>
<source>Am J Psychol</source>
<year iso-8601-date="1963">1963</year>
<volume>76</volume>
<fpage>705</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.2307/1419730</pub-id></element-citation>
</ref>
<ref id="B18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rumelhart</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>GE</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>RJ</given-names>
</name>
</person-group>
<article-title>Learning representations by back-propagating errors</article-title>
<source>Nature</source>
<year iso-8601-date="1986">1986</year>
<volume>323</volume>
<fpage>533</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.1038/323533a0</pub-id></element-citation>
</ref>
<ref id="B19">
<label>19</label>
<element-citation publication-type="web">
<person-group person-group-type="author">
<name>
<surname>Kingma</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Ba</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Adam: a method for stochastic optimization</article-title>
<comment>arXiv:1412.6980 [Preprint]. 2015 [cited 2023 Feb 14]. Available from: <uri xlink:href="https://doi.org/10.48550/arXiv.1412.6980">https://doi.org/10.48550/arXiv.1412.6980</uri></comment>
</element-citation>
</ref>
<ref id="B20">
<label>20</label>
<element-citation publication-type="confproc">
<person-group person-group-type="editor">
<name>
<surname>Paszke</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gross</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chintala</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chanan</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>E</given-names>
</name>
<name>
<surname>DeVito</surname>
<given-names>Z</given-names>
</name>
<etal>et al.</etal>
</person-group>
<comment>Automatic differentiation in PyTorch. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017 Dec 4–9; Long Beach, CA, USA. NIPS; 2017.</comment>
</element-citation>
</ref>
<ref id="B21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Paszke</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gross</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Massa</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Lerer</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bradbury</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chanan</surname>
<given-names>G</given-names>
</name>
<etal>et al.</etal>
</person-group>
<article-title>PyTorch: an imperative style, high-performance deep learning library</article-title>
<source>Adv Neural Inf Process Syst</source>
<year iso-8601-date="2019">2019</year>
<volume>32</volume>
<fpage>8026</fpage>
<lpage>37</lpage>
</element-citation>
</ref>
<ref id="B22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gilmer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schoenholz</surname>
<given-names>SS</given-names>
</name>
<name>
<surname>Riley</surname>
<given-names>PF</given-names>
</name>
<name>
<surname>Vinyals</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Dahl</surname>
<given-names>GE</given-names>
</name>
</person-group>
<article-title>Neural message passing for quantum chemistry</article-title>
<source>PMLR</source>
<year iso-8601-date="2017">2017</year>
<volume>70</volume>
<fpage>1263</fpage>
<lpage>72</lpage>
</element-citation>
</ref>
<ref id="B23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Swanson</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Jin</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Coley</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Eiden</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>H</given-names>
</name>
<etal>et al.</etal>
</person-group>
<article-title>Analyzing learned molecular representations for property prediction</article-title>
<source>J Chem Inf Model</source>
<year iso-8601-date="2019">2019</year>
<volume>59</volume>
<fpage>3370</fpage>
<lpage>88</lpage>
<comment>Erratum in: J Chem Inf Model. 2019;59:5304–5.</comment>
<pub-id pub-id-type="doi">10.1021/acs.jcim.9b00237</pub-id><pub-id pub-id-type="pmid">31361484</pub-id><pub-id pub-id-type="pmcid">PMC6727618</pub-id></element-citation>
</ref>
<ref id="B24">
<label>24</label>
<element-citation publication-type="web">
<article-title>Message passing neural networks [Internet]</article-title>
<comment>DeepChem; c2022 [cited 2023 Feb 14]. Available from: <uri xlink:href="https://github.com/deepchem/deepchem/tree/master/contrib/mpnn">https://github.com/deepchem/deepchem/tree/master/contrib/mpnn</uri></comment>
</element-citation>
</ref>
<ref id="B25">
<label>25</label>
<element-citation publication-type="book">
<person-group person-group-type="editor">
<name>
<surname>Duvenaud</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Maclaurin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Aguilera-Iparraguirre</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gómez-Bombarelli</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Hirzel</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Aspuru-Guzik</surname>
<given-names>A</given-names>
</name>
<etal>et al.</etal>
</person-group>
<source>Convolutional networks on graphs for learning molecular fingerprints</source>
<publisher-loc>NIPS 2015</publisher-loc>
<publisher-name>Proceedings of Advances in Neural Information Processing Systems</publisher-name>
<comment>2015 Dec 7–12; Montreal, Canada. NIPS; 2015. pp. 2215–23.</comment>
</element-citation>
</ref>
<ref id="B26">
<label>26</label>
<element-citation publication-type="book">
<person-group person-group-type="editor">
<name>
<surname>Ramsundar</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Eastman</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Walters</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pande</surname>
<given-names>V</given-names>
</name>
</person-group>
<source>Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more</source>
<publisher-loc>Sebastopol, CA</publisher-loc>
<publisher-name>O’Reilly Media</publisher-name>
<year iso-8601-date="2019">2019</year>
</element-citation>
</ref>
<ref id="B27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name>
<surname>Tanchuk</surname>
<given-names>VY</given-names>
</name>
</person-group>
<article-title>Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program</article-title>
<source>J Chem Inf Comput Sci</source>
<year iso-8601-date="2002">2002</year>
<volume>42</volume>
<fpage>1136</fpage>
<lpage>45</lpage>
<pub-id pub-id-type="doi">10.1021/ci025515j</pub-id><pub-id pub-id-type="pmid">12377001</pub-id></element-citation>
</ref>
<ref id="B28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Delaney</surname>
<given-names>JS</given-names>
</name>
</person-group>
<article-title>ESOL: estimating aqueous solubility directly from molecular structure</article-title>
<source>J Chem Inf Comput Sci</source>
<year iso-8601-date="2004">2004</year>
<volume>44</volume>
<fpage>1000</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1021/ci034243x</pub-id><pub-id pub-id-type="pmid">15154768</pub-id></element-citation>
</ref>
<ref id="B29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Palmer</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Mitchell</surname>
<given-names>JB</given-names>
</name>
</person-group>
<article-title>Is experimental data quality the limiting factor in predicting the aqueous solubility of druglike molecules?</article-title>
<source>Mol Pharm</source>
<year iso-8601-date="2014">2014</year>
<volume>11</volume>
<fpage>2962</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="doi">10.1021/mp500103r</pub-id><pub-id pub-id-type="pmid">24919008</pub-id></element-citation>
</ref>
<ref id="B30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Efron</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Bootstrap methods: another look at the Jackknife</article-title>
<source>Ann Stat</source>
<year iso-8601-date="1979">1979</year>
<volume>7</volume>
<fpage>1</fpage>
<lpage>26</lpage>
<pub-id pub-id-type="doi">10.1214/aos/1176344552</pub-id></element-citation>
</ref>
<ref id="B31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wortmann</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>CL</given-names>
</name>
<name>
<surname>Edmondson</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Trauma and PTSD symptoms: does spiritual struggle mediate the link?</article-title>
<source>Psychol Trauma</source>
<year iso-8601-date="2011">2011</year>
<volume>3</volume>
<fpage>442</fpage>
<lpage>52</lpage>
<pub-id pub-id-type="doi">10.1037/a0021413</pub-id><pub-id pub-id-type="pmid">22308201</pub-id><pub-id pub-id-type="pmcid">PMC3269830</pub-id></element-citation>
</ref>
<ref id="B32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Adèr</surname>
<given-names>HJ</given-names>
</name>
<name>
<surname>Mellenbergh</surname>
<given-names>GJ</given-names>
</name>
<name>
<surname>Hand</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Advising on research methods: a consultant’s companion</article-title>
<source>Jvank</source>
<year iso-8601-date="2008">2008</year>
<volume>574</volume>
<elocation-id>2991</elocation-id>
<pub-id pub-id-type="doi">10.1080/02664763.2011.559375</pub-id></element-citation>
</ref>
<ref id="B33">
<label>33</label>
<element-citation publication-type="confproc">
<person-group person-group-type="editor">
<name>
<surname>Lundberg</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>SA</given-names>
</name>
</person-group>
<comment>A unified approach to interpreting model predictions. NIPS 2017: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4–9; California, USA. NY, United States: Curran Associates Inc.; 2017. pp. 4768–77.</comment>
</element-citation>
</ref>
<ref id="B34">
<label>34</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Shapley</surname>
<given-names>LS</given-names>
</name>
</person-group>
<article-title>A value for n-person games</article-title>
<person-group person-group-type="editor">
<name>
<surname>Kuhn</surname>
<given-names>HW</given-names>
</name>
<name>
<surname>Tucker</surname>
<given-names>AW</given-names>
</name>
</person-group>
<source>Contributions to the theory of games</source>
<publisher-loc>Princeton</publisher-loc>
<publisher-name>Princeton University Press</publisher-name>
<year iso-8601-date="1953">1953</year>
<comment>pp. 307–18.</comment>
<pub-id pub-id-type="doi">10.1515/9781400881970-018</pub-id></element-citation>
</ref>
<ref id="B35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodríguez-Pérez</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bajorath</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions</article-title>
<source>J Comput Aided Mol Des</source>
<year iso-8601-date="2020">2020</year>
<volume>34</volume>
<fpage>1013</fpage>
<lpage>26</lpage>
<pub-id pub-id-type="doi">10.1007/s10822-020-00314-0</pub-id><pub-id pub-id-type="pmid">32361862</pub-id><pub-id pub-id-type="pmcid">PMC7449951</pub-id></element-citation>
</ref>
<ref id="B36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lundberg</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Erion</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>H</given-names>
</name>
<name>
<surname>DeGrave</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Prutkin</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Nair</surname>
<given-names>B</given-names>
</name>
<etal>et al.</etal>
</person-group>
<article-title>From local explanations to global understanding with explainable AI for trees</article-title>
<source>Nat Mach Intell</source>
<year iso-8601-date="2020">2020</year>
<volume>2</volume>
<fpage>56</fpage>
<lpage>67</lpage>
<pub-id pub-id-type="doi">10.1038/s42256-019-0138-9</pub-id><pub-id pub-id-type="pmid">32607472</pub-id><pub-id pub-id-type="pmcid">PMC7326367</pub-id></element-citation>
</ref>
<ref id="B37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Riniker</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Landrum</surname>
<given-names>GA</given-names>
</name>
</person-group>
<article-title>Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods</article-title>
<source>J Cheminform</source>
<year iso-8601-date="2013">2013</year>
<volume>5</volume>
<elocation-id>43</elocation-id>
<pub-id pub-id-type="doi">10.1186/1758-2946-5-43</pub-id><pub-id pub-id-type="pmid">24063533</pub-id><pub-id pub-id-type="pmcid">PMC3852750</pub-id></element-citation>
</ref>
<ref id="B38">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Polishchuk</surname>
<given-names>PG</given-names>
</name>
<name>
<surname>Kuz’min</surname>
<given-names>VE</given-names>
</name>
<name>
<surname>Artemenko</surname>
<given-names>AG</given-names>
</name>
<name>
<surname>Muratov</surname>
<given-names>EN</given-names>
</name>
</person-group>
<article-title>Universal approach for structural interpretation of QSAR/QSPR</article-title>
<source>Mol Inf</source>
<year iso-8601-date="2013">2013</year>
<volume>32</volume>
<fpage>843</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.1002/minf.201300029</pub-id><pub-id pub-id-type="pmid">27480236</pub-id></element-citation>
</ref>
<ref id="B39">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sheridan</surname>
<given-names>RP</given-names>
</name>
</person-group>
<article-title>Interpretation of QSAR models by coloring atoms according to changes in predicted activity: How robust is it?</article-title>
<source>J Chem Inf Model</source>
<year iso-8601-date="2019">2019</year>
<volume>59</volume>
<fpage>1324</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="doi">10.1021/acs.jcim.8b00825</pub-id><pub-id pub-id-type="pmid">30779563</pub-id></element-citation>
</ref>
<ref id="B40">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Harren</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Matter</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Hessler</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Rarey</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Grebner</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Interpretation of structure–activity relationships in real-world drug design data sets using explainable artificial intelligence</article-title>
<source>J Chem Inf Model</source>
<year iso-8601-date="2022">2022</year>
<volume>62</volume>
<fpage>447</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="doi">10.1021/acs.jcim.1c01263</pub-id><pub-id pub-id-type="pmid">35080887</pub-id></element-citation>
</ref>
<ref id="B41">
<label>41</label>
<element-citation publication-type="web">
<article-title>RDKit: open-source cheminformatics software [Internet]</article-title>
<comment>GitHub; [cited 2023 Feb 14]. Available from: <uri xlink:href="https://www.rdkit.org">https://www.rdkit.org</uri></comment>
</element-citation>
</ref>
<ref id="B42">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Hsieh</surname>
<given-names>CY</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Liao</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z</given-names>
</name>
<etal>et al.</etal>
</person-group>
<article-title>Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models</article-title>
<source>J Cheminform</source>
<year iso-8601-date="2021">2021</year>
<volume>13</volume>
<elocation-id>12</elocation-id>
<pub-id pub-id-type="doi">10.1186/s13321-020-00479-8</pub-id><pub-id pub-id-type="pmid">33597034</pub-id><pub-id pub-id-type="pmcid">PMC7888189</pub-id></element-citation>
</ref>
<ref id="B43">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Johannes</surname>
<given-names>JW</given-names>
</name>
<name>
<surname>Balazs</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Barratt</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bista</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Chuba</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Cosulich</surname>
<given-names>S</given-names>
</name>
<etal>et al.</etal>
</person-group>
<article-title>Discovery of 5-{4-[(7-Ethyl-6-oxo-5,6-dihydro-1,5-naphthyridin-3-yl)methyl]piperazin-1-yl}-<italic>N</italic>-methylpyridine-2-carboxamide (AZD5305): a PARP1–DNA trapper with high selectivity for PARP1 over PARP2 and other PARPs</article-title>
<source>J Med Chem</source>
<year iso-8601-date="2021">2021</year>
<volume>64</volume>
<fpage>14498</fpage>
<lpage>512</lpage>
<pub-id pub-id-type="doi">10.1021/acs.jmedchem.1c01012 </pub-id><pub-id pub-id-type="pmid">34570508</pub-id></element-citation>
</ref>
<ref id="B44">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kong</surname>
<given-names>NR</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Che</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>LH</given-names>
</name>
</person-group>
<article-title>Physicochemistry of cereblon modulating drugs determines pharmacokinetics and disposition</article-title>
<source>ACS Med Chem Lett</source>
<year iso-8601-date="2021">2021</year>
<volume>12</volume>
<fpage>1861</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="doi">10.1021/acsmedchemlett.1c00475</pub-id><pub-id pub-id-type="pmid">34795877</pub-id><pub-id pub-id-type="pmcid">PMC8591734</pub-id></element-citation>
</ref>
</ref-list>
</back>
</article>