Je, Kim, and Kim: Investigating Chemical Interpretability in Graph Neural Networks via Atom-wise Shapley Additive Explanations

INTRODUCTION▼

The development of machine learning (ML) predictive models has been significantly accelerated nowadays by the rapid advancements in computational power and the exponential growth of data across various disciplines.^1-3 Chemistry has also benefited from this trend, with the growing accessibility of molecular physicochemical data allowing researchers to apply various ML models to predict molecular characteristics.^4-7 Among these models, deep learning has demonstrated remarkable accuracy and efficiency, making it a preferred choice for predicting molecular properties or activities.^8-10 Over the past few years, extensive efforts have been made to develop predictive models, leading to highly accurate deep learning models for applications such as drug discovery¹¹ and new material design.^12-14

However, these deep learning models often rely on complex algorithms that lack interpretability. Such limitations make it difficult for chemists to extract scientific insights from their predictions and verify whether the model is appropriately trained by maintaining consistency with established chemical knowledge and theoretical principles.^15,16 This challenge arises due to the black-box nature of deep learning models, which obscures the decision-making process for designing new chemicals based on their predictions. In other words, it is hard to backtrace model’s reasoning process that leads to specific predicted chemical property values. To address this issue, various strategies for models’ interpretability have been proposed. The importance of interpretable ML has also been highlighted in advancing renewable energy¹⁷ and other applications.¹⁸ These strategies can be broadly categorized into model-specific approaches and model-agnostic methods.¹⁹ Among these approaches, additive explanations have gained significant attention due to their intuitive nature and strong alignment with traditional chemical heuristics. Traditionally, the concept of additivity has long been utilized in chemistry, as seen in Benson’s group additivity method²⁰ and Joback’s thermodynamic property estimation approach.²¹ Likewise, researchers can enhance the interpretability of data-driven predictions by integrating additive explanation techniques into deep learning models, making them more accessible and scientifically meaningful.

In this regard, SHAP(SHapley Additive exPlanations)²² has been widely utilized as one of the additive explanation methods. SHAP provides a model-independent approach to interpretability, making it applicable to various ML models, including random forest (RF), support vector machine (SVM), and neural networks.¹¹ Specifically, SHAP-based interpretation techniques, such as DeepExplainer and TreeExplainer, have been developed to analyze feature contributions in different models.^23-25 While SHAP is well-suited for models predicting outputs based on independent or less interdependent features, its application to graph neural networks (GNNs) remains challenging.

Recent studies have proposed modified SHAP techniques tailored for GNN prediction models, but they entail complex Monte Carlo sampling procedure and their applications in chemistry are rare.^26-30 Meanwhile, various methods besides SHAP have also been developed and applied to GNNs for their explainability, such as GNNExplainer, class activation map (CAM), and gradient CAM (GCAM).^31-34 Such methods provide valuable insights into molecular structures and their contributions to specific properties. Nevertheless, further validation is required in terms of chemical feasibility of these interpretations and their applicability to broader chemical systems. Specifically, if one can obtain explanations in terms of SHAP-based atom-wise contributions to molecules’ properties, their chemical validity can be readily assessed by associating the contribution values with chemical knowledge.

Here, we developed the modified SHAP strategies that yield atom-wise contribution values explaining GNN prediction results (Scheme 1). Then, we applied this method to the previously reported GNNs for predicting fuels’ reactivity (cetane numbers; CNs)³⁵ and Gibbs free energy of solvation (ΔG_solv).³⁶ Finally, we found appropriate chemical explanations from the atom-wise contribution values and their consistency with known chemical knowledge was evaluated. Chemical feasibility of contribution values to CN were assessed in terms of the cleavage of the weakest bond and radical stability which are relevant to fuel’s reactivity. Meanwhile, the GNNs for ΔG_solv were applied to explaining varying reaction rates in different organic solvents. The prediction results were explained by associating positive/negative contribution values with reactants’ destabilization/stabilization in solvents, and by evaluating the correlation between atom-wise SHAP values and reaction rates. Our results demonstrate that SHAP-derived descriptors can be obtained without computationally expensive quantum-mechanical calculations, and they effectively capture fuel’s reactivity and other atomic properties related to reactivity in solvents.

Scheme1.

Workflow for enhancing interpretability in graph neural networks via atom-wise Shapley Additive Explanations.

METHODS▼

Detailed explanation about the GNN architecture is available in our previous studies.^35,36 Here, we briefly describe the overall GNN architecture in the context of applying the modified SHAP analysis to the models. Fig. 1 illustrates the structures of hidden layers of our GNNs used for predicting cetane number (CN) and Gibbs free energy of solvation (ΔG_solv). The input molecule is given as a two-dimensional structure described as Simplified Molecular Input Line Entry System (SMILES) strings. Atom, bond, and global features of a molecule are then generated using the RDKit Python cheminformatics package.³⁷ Each feature is converted into N_dim-dimensional input embedding vectors and subjected to message-passing layers. The embedded vectors pass through the hidden message-passing layers and undergo mathematical operations: addition, element-wise multiplication, rectified linear unit (ReLU) activation, concatenation etc. At each message-passing layer, the atom, bond, and global feature vectors are updated based on the structural information gained from trilateral relationships among them. Through these updates, the GNN model learns the inferences regarding the molecular structural influences on the target molecular properties for prediction. The weight matrices inside the message-passing layers are tuned during the training so that the model accurately predicts the target molecular property.

Figure1.

Architecture of the graph neural networks for predicting cetane numbers and ΔG_solv.

The optimal values of N_dim are 64 and 128 for the GNNs predicting CN, ΔG_solv, respectively. The optimal number of hidden layers is five for both models. Detailed procedure regarding the hyperparameter tuning is available in our previous publications.^35,36 As depicted in Fig. 1, the GNNs entail complex arithmetic calculations among the vectors representing molecular structures. Therefore, it is not straightforward to directly apply SHAP or other ML interpretability analysis to the entire model. Instead, we built simplified surrogate models in which the modified SHAP analysis were adopted for chemical interpretation of the prediction results. We named the analysis as ‘modified’ SHAP, since it is based on SHAP DeepExplainer developed for generic neural networks, and it was modified and applied to the message-passing GNNs. Specifically, we developed the model that inputs atom and global features cumulatively updated from the 4^th layer of the model (Fig. 2(a)). The 4^th-layer atom and global vectors and trained weights are adopted from the original model and adopted in the surrogate model for applying modified SHAP. In other words, no new model training is performed; the developed GNNs are truncated and duplicated so that they reproduce the same prediction values with the original model.

These vectors from the 4^th layer contain detailed inferences about molecular structures delivered from the previous four layers. Therefore, they are appropriate to be utilized as inputs for the modified SHAP analysis. In the simplified model for SHAP, the vectors from the 4^th layer undergo updates that are supposed to be done in the last (5^th) layer (Fig. 2(a)). The 4^th-layer bond state vectors were omitted from the SHAP models as they do not effectively change atom and global state vectors in the 5^th layer. However, it should be noted that the bond state information also affects the SHAP values, because molecular information is received and transferred by bond state vectors during the updates in the 1^st-4^th hidden layers. After the 5^th layer, the updated N_dim-dimensional global state vector is subjected to the final dense layer for readout, resulting in the predicted CN. The SHAP DeepExplainer method is applied to this simplified surrogate model, resulting in SHAP values corresponding to the 4^th-layer atom and global vectors, respectively. These SHAP vectors consist of values implying additive relationships between model inputs (atom and global states) and prediction values (CNs). Atom-wise contribution values are calculated from the SHAP values (vide infra for details), and model-predicted values are chemically explained based on these contribution values.

Figure2.

Architectures of the surrogate GNN models to implement the modified SHAP analysis. Atom-wise contribution values were obtained for two predictive models for (a) cetane numbers and (b) ΔG_solv.

Meanwhile, we also built the surrogate model for ΔG_solv to carry out the modified SHAP analysis (Fig. 2(b)). Overall, the structure is similar to the surrogate model for CN (Fig. 2(a)), but it takes four input vectors: the 4^th-layer atom and global vectors for a solute and solvent molecule. Two input vectors for solute and solvent undergo the 5^th update, separately, and the two final global state vectors pass through readout layers, outputting ΔG_solv. The SHAP values are then obtained for the four input vectors, and they are transformed into atom-wise contribution values for atoms in a solute. One can distribute SHAP values to atoms in either solute or solvent, but here solute was only chosen to scrutinize solvation chemistry of one solute stemming from different solvents.

Fig. 3 illustrates the details about how SHAP values and atom-wise contribution values are evaluated from the surrogate models. From the CN prediction model (Fig. 3(a)), we obtained two SHAP vectors: SHAP_atom and SHAP_glob for an averaged atom vector and global vector, respectively. The total SHAP value for one molecule is obtained by element-wise summation of these two vectors:

Figure3.

Detailed illustration of the surrogate model architectures and resulting SHAP vectors for the predictive model of (a) cetane number and (b) ΔG_solv. (c) Proportional distribution of SHAP values to calculate atom-wise contribution values.

(1)

S H A P total = S H A P atom + S H A P glob .

Meanwhile, the surrogate model also outputs the predicted CN value (CN_surr). Fig. 3(b) depicts the surrogate model and SHAP vectors for the ΔG_solv model. The total SHAP vector for a molecule is obtained by adding four SHAP vectors corresponding to atom and global states of a solvent and solute:

(2)

S H A P total = S H A P atom,solvent + S H A P glob,solvent + S H A P atom,solute + S H A P glob,solute .

The surrogate model also yields the predicted ΔG_solv (ΔG_solv,surr).

Of note, the quality of SHAP-based explanations largely depends on how accurately the surrogate model captures the behavior of the original GNN, particularly in cases involving high-dimensional inputs or strongly nonlinear relationships. When the numerical complexity of the original model prevents the surrogate from reproducing its performance, the applicability of our modified SHAP approach becomes limited. Although the method is generally applicable, careful consideration is required during both implementation and interpretation, especially in cases where constructing a reliable surrogate model is challenging. For reliable interpretation, the predicted CN and ΔG_solv from the surrogate models (CN_surr and ΔG_solv,surr) should be identical to those from the original GNN model (CN_GNN and ΔG_solv,GNN). We verified the identicalness of those two values and confirmed that our surrogate models are valid. Errors below 0.001 were obtained in terms of the mean absolute error (MAE) between CN_surr and CN_GNN for 630 molecules, and for the MAE between ΔG_solv,surr and ΔG_solv,GNN for 270 molecules. Therefore, our surrogate models virtually perform the same predictions with the original GNN models, within the numerical error. This validity check demonstrates that these surrogate models are appropriate for the chemical explanation through modified SHAP analysis discussed in Results and Discussion.

The values in SHAP_total need to be appropriately distributed to each atom to obtain atom-wise contribution values, as described in Fig. 3(c). Let the vector elements of SHAP_total be:

(3)

S H A P total = s 1 , s 2 , … , s Ndim ,

and those of the 4^th-layer atom state vector of Atom 1, Atom N are:

(4)

A Atom_1 = a 11 , a 12 , … , a 1 Ndim , A Atom _ N = a N 1 , a N 2 , … , a NNdim .

The SHAP values are proportionally distributed according to the ratio of the values of an atom state vector with respect to all atoms, as follows:

(5)

A ' Atom _ 1 = s 1 a 11 ∑ i N a i 1 , s 2 a 12 ∑ i N a i 2 , … , s N d i m a 1 N d i m ∑ i N a i N d i m ,

where i denotes the atom index. It should be noted that the denominator is summation across different atoms. Then, the atom-wise contribution value for atom i (C_{Atom_i}) is obtained by the summation of all elements in A’:

(6)

C Atom _ 1 = ∑ j N d i m s j a i j ∑ i N a i j ,

where j denotes the element index in a vector. Sum of all C_{Atom_i}’s in a molecule is the predicted CN or ΔG_solv obtained from atom-wise contribution values of our modified SHAP approach:

(7)

CN mod _ SHAP or Δ G solv,mod − SHAP = ∑ i C Atom _ i .

If the method depicted in Fig. 3(c) is valid, there should exist a linear relationship between predicted values from the surrogate model and those from (7):

(8)

CN surr = a ⋅ C N mod-SHAP + b and Δ G solv,surr = a ⋅ Δ G solv,mod − SHAP + b ,

with a close to 1.0. In other words, if the modified SHAP method is reliable, no scaling is needed between predicted values from the surrogate model and those from the modified SHAP. We found that a’s are identical to 1.0 within the numerical errors of 0.0001, for 630 CNs and 270 ΔG_solv values. Such sanity check is another justification of our modified SHAP method.

The intercept b is used to correct C_{Atom_i}’s:

(9)

C' Atom_i = C Atom _ i + b / N

Where N is the number of atoms in a molecule. The sum of C’_{Atom_i}’s then becomes CN_surr or ΔG_solv,surr. It should be noted that adding the same constant value (b/N) to all atoms does not affect the relative difference of atom-wise contribution values among atoms.

All the above methodologies were implemented using the following Python packages for GNNs: Tensorflow 2.4,³⁸ Keras 2.9,³⁹ and neural fingerprint (NFP) 0.3.0,⁴⁰ and the Python package for SHAP.²² Pickle (.pkl) format was used to save hidden layer vectors and trained weights of the original models for CN and ΔG_solv and use them in the surrogate models for the modified SHAP analysis. ALFABET, the ML model for predicting bond dissociation enthalpies, was used to find the correlation between the weakest bond dissociation enthalpies (BDEs) and atom-wise contribution values to CN.^41,42

RESULTS AND DISCUSSION▼

Application 1: Fuel reactivity vs. atom-wise contribution to cetane numbers

We adopted the database consisting of cetane numbers for 630 molecules from the previous study³⁵ to scrutinize the relationship between atom-wise SHAP contribution values and fuel reactivity (CN). Such information offers design guidelines in the atomistic scale for new alternative fuels. Our previous study applied the modified SHAP analysis to only four ether molecules without a detailed explanation of methodologies. The feasibility of our approach was also not verified for the whole 630 molecules in the database. Here, atom-wise contribution values were obtained for these 630 molecules using SHAP analysis as explained in the Methods section. Then, we performed a matching analysis between the atoms with the highest or lowest SHAP values and those involved in the weakest bond dissociation energies (BDEs). This analysis is to validate whether the SHAP-derived atom-wise contributions reflect chemically meaningful features in combustion. Typically, the weakest bond is the most dominant active site for the initiation step of combustion reactions.^43,44 Therefore, the atom pairs involved in the weakest bond should be highly relevant with fuel’s reactivity (i.e., CN), and corresponding atom-wise contributions should be the highest or lowest ones among atoms.

The rationale for considering both maximum and minimum SHAP atoms lies in the dual nature of radical formation upon bond cleavage. Breaking the weakest bond forms either stabilized or destabilized radical species. The former tends to decrease reactivity, resulting in minimum SHAP contribution values and thus a negative influence on CN. In contrast, the generation of destabilized, highly reactive radical species leads to increased fuel reactivity, which corresponds to maximum SHAP values and a positive effect on CN. Therefore, there should be a correlation among the weakest BDEs, minimum/maximum atomwise contribution values, and fuel reactivity.

To validate this hypothesis, the matching analysis was carried out, and Fig. 4 shows the results for 10 representative molecules. Our CN prediction model accurately predicted the CNs of these 10 molecules, supporting that the reliability of our SHAP analysis is not compromised by the prediction accuracy of the GNN model. In Fig. 4(a), coincidences were observed between the weakest BDEs and atoms with the highest or lowest SHAP values. While BDE is a chemical property and SHAP values are derived from statistical explainability methods, this alignment indicates that SHAP effectively captures the underlying chemistry of initial bond dissociation in combustion mechanisms. Although a few mismatched cases were observed (Fig. 4(b)), these cases typically occur when SHAP values were similarly distributed across atoms, making it difficult to localize specific reactive sites. Additionally, there could be uncertainties in BDE values and the possibility that multiple bonds may be involved in the initiation step of fuel ignition. In this regard, we conducted the matching analysis by allowing a certain level of tolerance in BDEs.

Figure4.

Molecular structures with SHAP-Based atom contributions and weakest BDE positions, and cetane number (CN) predictions. Visualization of randomly selected molecules, with (a) five matched molecules and (b) five mismatched molecules. SHAP values are mapped onto atoms, where blue/red represents positive/negative contributions to CN, respectively. The weakest bond dissociation energy (BDE) positions are indicated by purple arrows. The experimental and predicted cetane numbers (CN) for each molecule, represented as Exp./Predicted, are provided below their respective molecular structures.

As shown in Table 1, when the ranking window for the weakest bonds was expanded from the Top 1 to Top 5, the matching percentage increased from 46.2% to 81.9%. No significant differences were observed in the MAEs between experimental and predicted CNs for the matching and mismatching group molecules: 0.3–0.6 CN units. This indicates that the concurrence between the weakest bond and atom-wise contribution does not rely on model’s accuracy but underlying chemistry of ignition. Similarly, when bonds within +2 – +10 kcal/mol above the weakest BDE were included, the matching percentage reached up to 76.0% (Table 2). This result suggests that the mismatching cases like those in Fig. 4(b) can be regarded as matching ones if one considers uncertainties in BDEs. These significantly high percentages highlight the potential applicability of SHAP analysis for providing chemical explanations of fuel reactivity.

Table1.

Results from the matching analysis of the nth-weakest bonds and atoms with maximum/minimum SHAP contribution values^a

Weakest BDE Rank	Matching Molecules	Matching %	Matching Group MAE	Mis-matching Group MAE
1st	46.2	46.2	2.249	2.610
1st ~ 2nd	378	60.0	2.256	2.723
1st ~ 3rd	444	70.5	2.271	2.867
1st ~ 4th	490	77.8	2.367	2.737
1st ~ 5th	516	81.9	2.345	2.915

^aThe number of molecules (out of a total of 630) where the atoms forming the weakest bond coincide with those having the highest or lowest SHAP values. The percentage is calculated based on the total dataset. Mean absolute error (MAE) values correspond to those between predicted and experimental cetane numbers.

Table2.

Results from the matching analysis of bonds within the bond dissociation energy (BDE) thresholds and atoms with maximum/minimum SHAP contribution values^a

BDE Threshold (kcal/mol)	Matching Molecules	Matching %
BDEmin + 0	291	46.2
BDEmin + 2	369	58.6
BDEmin + 4	444	70.5
BDEmin + 6	456	72.4
BDEmin + 8	470	74.6
BDEmin + 10	479	76.0

^aThe number of molecules (out of a total of 630) where the atoms forming the weakest bond coincide with those having the highest or lowest SHAP values. The weakest bond in each molecule, defined as the bond with the lowest bond dissociation energy (BDE_min), was used as the reference. Additional thresholds include all bonds with BDE values within 2, 4, 6, 8, and 10 kcal/mol from BDE_min. The percentage is calculated based on the total dataset.

Moreover, five cases in Fig. 4(a) demonstrate the consistency between the stability of radicals formed from breaking the weakest bond and the sign of atom-wise contribution values (Fig. 5(a)). The above two molecules in Fig. 5(a) are aromatic, forming resonance-stabilized radicals from cleaving the weakest bond. This stabilization leads to lower fuel reactivity, which is consistent with the negative contribution values of the atom pair corresponding to the weakest bond (Red circles of the weakest bond atoms in Fig. 4(a)). In contrast, the below three molecules in Fig. 5(a) are not aromatic, yielding reactive radicals with no resonance stabilization from the weakest bond dissociation. Such gains in reactivity are well-reflected in the positive contribution values (Blue circles of the weakest bond atoms in Fig. 4(a)).

Further validation was performed for the whole database consisting of 630 molecules. We investigated the correlation between the sum of SHAP contribution values of two atoms involved in the weakest bond and CN (Fig. 5(b)). CNs were normalized by the number of atoms to avoid inconsistencies in the magnitude of contribution values stemming from different sizes of molecules. Negative contribution values were flattened to zero (max(0,C’_A1) + max(0,C’_A2) in Fig. 5(b)), since CNs are defined as positive, except for only two molecules in the database. It should be noted that this correction is only applied in this correlation analysis, and it does not indicate the incorrectness of negative SHAP values. As seen in Fig. 5(a), they can contain information about radical stability. As a result, we observed a strong positive correlation between the contribution values and CNs with a Pearson correlation coefficient of 0.82. Six outliers in Fig. 5(b) show deviations from the positive correlation because they are symmetric molecules or have a long chain that can lead to multiple reactive centers. This analysis demonstrates that the atom-wise contribution values can capture the statistical and chemical significance of the weakest bond in determining CNs.

Figure5.

(a) Five examples illustrating the consistency between the stability of radicals from breaking the weakest bond and the sign of atom-wise contribution values. (b) The scatter plot depicting the correlation between the sum of SHAP contribution values of two atoms involved in the weakest bond and the cetane number normalized by the number of atoms.

In summary, the SHAP values matched well with BDEs and our chemical knowledge in most cases, indicating that GNNs can provide not only accurate CN predictions but also chemically meaningful atom-wise SHAP contributions. This enables researchers to save time and resources otherwise required for computing BDEs and investigating ignition reaction mechanisms via computationally expensive quantum-mechanical methods. Furthermore, one can rapidly and reliably identify the active atoms primarily involved in combustion reactions.

Application 2: Linear solvation free energy relationships based on SHAP-derived descriptors

Linear free energy relationship (LFER)⁴⁵ and linear solvation energy relationship (LSER)⁴⁶ are used as one of the fundamental tools for understanding organic reactions and catalysis. As an extension of LSER, a previous work summarized in Fig. 6(a) explored the correlation between GNN-predicted ΔG_solv and the reaction rate constant(k).³⁶ These correlations are based on the Hammond postulate; if the solvation destabilizes the reactant and stabilizes the product, the corresponding transition state could resemble the reactant more than that without solvents. Such solvents can decrease activation energy, and thus increase k. This hypothesis was verified for 10 organic reactions, with a strong correlation between ΔG_solv and ln k (Absolute value of Pearson correlation coefficient higher than 0.8). However, for some reactions, ΔG_solv and k exhibit a low Pearson correlation coefficient, indicating that ΔG_solv may not serve as the optimal descriptor for reactivity prediction. To overcome this challenge, we propose utilizing atom-wise SHAP contribution values as an alternative descriptor for reaction rate prediction. By leveraging SHAP-based atom contributions, we provide a chemically meaningful interpretation of molecular reactivity while avoiding the limitations of ΔG_solv.

Fig. 6(b) illustrates the SHAP-derived atom-wise contribution descriptors devised for further identification of LSERs. Each atom in a solute molecule has its atom-wise contributions, and maximum/minimum contribution values were obtained among atoms. We assumed that these values can be relevant to electrophilicity and nucleophilicity, or positive/negative atomic charges. The standard deviation of contribution values was also considered, since it can be related to polarizability of a solute molecule in a given solvent.

Figure6.

Summary of (a) a previous study regarding the relationship between ΔG_solv and reaction rates in different solvents, and (b) this work utilizing atom-wise SHAP values as descriptors to elucidate linear solvation energy relationships.

Fig. 7 shows the results of applying ΔG_solv and SHAP-derived descriptors to elucidating LSERs for three reaction examples: Diels-Alder, epoxidation, and [2+2] cycloaddition reaction. Their experimental reaction rates in different solvents were brought from the literature.⁴⁷ These reactions exhibited bad correlations between ΔG_solv and reaction rates in different solvents, with Pearson correlation coefficients (ρ) of 0.36, 0.11, and -0.18, respectively, necessitating alternative descriptors for LSER. We examined three types of SHAP-derived descriptors depicted in Fig. 6(b), and chose one that shows the best correlation. For Diels-Alder and [2+2] cycloaddition reaction (Reactions 1 and 3), standard deviation of reactant atom’s contributions normalized by reactant’s ΔG_solv gives rise to a strong correlation with reaction rates, with Pearson ρ of 0.75 and 0.90, respectively. The maximum of reactant’s contribution values leads to the Pearson ρ of -0.92 with the reaction rates of epoxidation reaction (Reaction 2).

Figure7.

Three reactions where experimental reaction rates in different solvents show weak correlations (i.e., low Pearson ρ) with ΔG_solv but strong correlations (i.e., high Pearson ρ) with SHAP-derived descriptors.

These results demonstrate that ΔG_solv is not always the best descriptor for LSERs and SHAP-derived descriptors can be instrumental in some reactions. Further interpretability was gained compared to our previous study³⁶ where atom-wise contributions to ΔG_solv were analyzed for only one drug-like molecule in three solvents, and SHAP-derived descriptors were never utilized for elucidating LSERs.

The chemical feasibility of SHAP-derived descriptors should be further examined by appropriately associating the atom-wise contribution values with chemical knowledge. In this regard, atom-wise contribution values were analyzed for the reactants of Reactions 1–3 in two selected solvents that lead to low and high reaction rates (Fig. 8). Methanol and water solvent show relatively low and high reaction rates (ln k_rel of 2.54 and 6.61, respectively) in the Diels-Alder reaction between cyclopentadiene and methyl vinyl ketone (Reaction 1). Both reactants in water have more positive atom-wise contribution values (i.e., blue colors in Fig. 8) than those in methanol, indicating larger reactant destabilization in water. Such results imply reactants’ hydrophobic interactions with water that result in acceleration of Diels-Alder reactions, which is consistent with the previous study discussing the influence of hydrophobicity on Diels-Alder reactions in water.⁴⁸ Such trends are also well reflected in the positive correlation between the standard deviation of atom-wise SHAP values and reaction rates (Fig. 7); a higher standard deviation is relevant to higher polarizability, and thus faster reaction.

Figure8.

Visualization of atom-wise contribution values of reactant molecules of three reactions in the two selected solvents where low and high reaction rates were observed.

Meanwhile, one can pursue increased rates of Reaction 2 (epoxidation) by stabilizing the active peroxycarboxylic acid (PCBA) through solvation. PCBA in dibutyl ether is relatively more destabilized than that in dichloromethane. Moreover, our results indicate that dibutyl ether induces unnecessary destabilization in the phenyl ring, while stabilizing the terminal -OH group that participates in the epoxidation reaction. The dichloromethane solvent results in stabilizing atom-wise contributions evenly for all atoms, whereas the -OH group shows the least negative contribution, leading to the activation of PCBA. This explanation is in line with a higher reaction rate in dichloromethane solvent (ln k_rel=4.06) compared to dibutyl ether (ln k_rel=0.0), and the negative linear relationship between maximum SHAP values and reaction rates (Fig. 7). As for the [2+2] cycloaddition (Reaction 3), inducing higher polarization to the alkene reactant may lead to faster reactions, which are consistent with the atom-wise SHAP values shown in Fig. 8. Acetonitrile exhibits alternating destabilization and stabilization contributions for two alkene carbons, whereas cyclohexane only stabilizes these carbons. These results can be correlated with the higher reaction rate in acetonitrile than in cyclohexane.

All the above results indicate that atom-wise SHAP values can be one of the good alternatives of atomic charges or electrophilicity/nucleophilicity which are typically obtained from expensive quantum-mechanical calculations. One can obtain electrophilicity, nucleophilicity information and its relationship with solubility and thus reactivity prior to computationally demanding QM calculations. The chemical interpretation of SHAP values is also consistent with literature and chemical knowledge, suggesting broad applicability of our modified SHAP approaches to chemical explanation.

CONCLUSION▼

In this study, we developed a modified SHAP strategy tailored for GNNs to enable atom-wise interpretation of molecular property predictions. By applying this method to predicting CNs and ΔGₛₒₗᵥ, we demonstrated that the derived atom-wise contribution values offer chemically meaningful explanations. For CN predictions, the SHAP-derived descriptors were consistent with key aspects of ignition chemistry, such as the cleavage of the weakest bond and radical stability. For ΔGₛₒₗᵥ predictions, the atom-wise contributions provided insights into solvent effects on reactivity, capturing subtle interactions not readily identified by conventional linear solvation free energy relationships.

Overall, our approach amplifies the interpretability of GNN-based predictions without requiring expensive quantum-mechanical calculations, offering a practical and insightful framework for data-driven molecular design and chemical understanding. This approach has the potential for broader applications to other GNN-based molecular property predictors, providing chemically meaningful explanations that can enhance insights in ML-driven chemical research. However, it is important to acknowledge its limitations in capturing properties that are inherently bond-centered. Some properties arise from interactions between pairs or groups of atoms and cannot be fully decomposed into atom-wise contributions. Such ‘edge-centric’ properties include bond dissociation energies, torsional barriers, ring strain, and charge transfer behavior in π-conjugated systems. The current framework can potentially be extended or combined with other explainable GNN methodologies to incorporate edgecentric or relational representations, enabling more chemically faithful explanations for such interaction-dependent phenomena.

Notes▼

[1] Supported by Supporting Information. All results in this manuscript are reproducible by using the Python source codes, Jupyter notebooks, and datasets in comma-separated variable (CSV) format available via GitHub repository: https://github.com/Yeonjoon-Kim/GNN_atomshap.

Acknowledgements▼

This work was supported by a Research Grant of Pukyong National University (2023).

References▼

M. I. Jordan T. M Mitchell Science2015349255 [CrossRef]

N. Ghaffar Nia E. Kaplanoglu A Nasab Discov. Artif. Intell.202335 [CrossRef]

S.-S. M. Ajibade F. V. Bekun F. F. Adedoyin B. A. Gyamfi A. O Adediran Clean Technol.20235497 [CrossRef]

S. Kuhn B. Egert S. Neumann C. Steinbeck BMC Bioinform.20089400 [CrossRef]

L. Chen H. Tran R. Batra C. Kim R. Ramprasad Comput. Mater. Sci.2019170109155 [CrossRef]

D. Seddon E. A. Müller J. T. Cabral J. Colloid Interface Sci.2022625328 [CrossRef]

T. Toyao K. Suzuki S. Kikuchi S. Takakusagi K.-i. Shimizu I. Takigawa J. Phys. Chem. C20181228315

Y. Chung F. H. Vermeire H. Wu P. J. Walker M. H. Abraham W. H. Green J. Chem. Inf. Model.202262433 [CrossRef]

G. Idakwo L. Joseph C. Minjun H. Huixiao Z. Zhaoxian G. Ping C. Zhang J. Environ. Sci. Health C201836169 [CrossRef]

10.

D. A. Saldana L. Starck P. Mougin B. Rousseau N. Ferrando B. Creton Energy Fuels2012262416 [CrossRef]

11.

R. Rodríguez-Pérez J. Bajorath J. Med. Chem.2020638761 [CrossRef]

12.

G. Chen Z. Shen A. Iyer U. F. Ghumman S. Tang J. Bi W. Chen Y. Li Polymers202012163 [CrossRef]

13.

R. Batra L. Song R. Ramprasad Nat. Rev. Mater.20216655

14.

T. Toyao Z. Maeno S. Takakusagi T. Kamachi I. Takigawa K.-I. Shimizu ACS Catal.2020102260 [CrossRef]

15.

T.-S. Vu M.-Q. Ha D.-N. Nguyen V.-C. Nguyen Y. Abe T. Tran H. Tran H. Kino T. Miyake K. Tsuda H-C. Dam Npj Comput. Mater.20239215 [CrossRef]

16.

K. V. Chuang M. Keiser ACS Chem. Biol.2018132819 [CrossRef]

17.

V. N. Nguyen W. Tarełko P. Sharma A. S. El-Shafay W.-H. Chen P. Q. P. Nguyen X. P. Nguyen A. T. Hoang Energy Fuels2024381692 [CrossRef]

18.

R. Rodríguez-Pérez J. Bajorath J. Med. Chem.20216417744 [CrossRef]

19.

R. Guidotti A. Monreale S. Ruggieri F. Turini F. Giannotti D. Pedreschi ACM Comput. Surv.201951Article 93

20.

S. W. Benson J. H. Buss J. Chem. Phys.195829546 [CrossRef]

21.

K. G. Joback R. C. Reid Chem. Eng. Commun.198757233 [CrossRef]

22.

S. M. Lundberg S.-I. Lee Advances in Neural Information Processing Systems201730

23.

H. Chen S. Lundberg S.-I. Lee A. Shaban-Nejad M. Michalowski D. L. Buckeridge Explainable AI in Healthcare and Medicine: Building a Culture of Transparency and AccountabilitySpringer International Publishing2021261

24.

S. M. Lundberg G. Erion H. Chen A. DeGrave J. M. Prutkin B. Nair R. Katz J. Himmelfarb N. Bansal S.-I. Lee 2019arXiv preprint. arXiv:1905.04610

25.

S. M. Lundberg G. Erion H. Chen A. DeGrave J. M. Prutkin B. Nair R. Katz J. Himmelfarb N. Bansal S.-I. Lee Nat. Mach. Intell.2020256

26.

A. Mastropietro G. Pasculli C. Feldmann R. Rodríguez-Pérez J. Bajorath iScience202225105043 [CrossRef]

27.

H. Chereda A. Leha T. Beißbarth Artif. Intell. Med.2024151102840 [CrossRef]

28.

Y. Wang M. Huang H. Deng W. Li Z. Wu Y. Tang G. Liu Brief. Bioinform.202324bbac577

29.

A. Duval F. D. Malliaros N. Oliver F. Pérez-Cruz S. Kramer J. Read J. A. Lozano Machine Learning and Knowledge Discovery in Databases. Research Track, Cham, 2021Springer International Publishing2021302

30.

H. Yuan H. Yu J. Wang K. Li S. Ji On Explainability of Graph Neural Networks via Subgraph ExplorationsProceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research2021

31.

A. Kotobi K. Singh D. Höche S. Bari R. H. Meißner A. Bande J. Am. Chem. Soc.202314522584 [CrossRef]

32.

G. P. Wellawatte H. A. Gandhi A. Seshadri A. D. White J. Chem. Theory Comput.2023192149 [CrossRef]

33.

Z. Ying D. Bourgeois J. You M. Zitnik J. Leskovec Advances in neural information processing systems201932

34.

J. Rao S. Zheng Y. Lu Y. Yang Patterns20223100628 [CrossRef]

35.

Y. Kim J. Cho N. Naser S. Kumar K. Jeong R. L. McCormick P. C. St. John S. Kim Proc. Combust. Inst.2023394969 [CrossRef]

36.

Y. Kim H. Jung S. Kumar R. S. Paton S. Kim Chem. Sci.20241592310.1039/D3SC03468B. [CrossRef]

37.

S. Riniker G. A. Landrum J. Chem. Inf. Model.2015552562 [CrossRef]

38.

M. Abadi P. Barham J. Chen Z. Chen A. Davis J. Dean M. Devin S. Ghemawat G. Irving M. Isard 12th USENIX symposium on operating systems design and implementation (OSDI 16)2016265

39.

A. Gulli S. Pal Deep learning with KerasPackt Publishing Ltd2017

40.

P. St John NFP (Neural Fingerprint) 0.3.0, National Renewable Energy Lab (NREL)Golden, CO, United States2019https://github.com/NREL/nfp

41.

P. C. St. John Y. Guan Y. Kim S. Kim R. S. Paton Nat. Commun.2020112328 [CrossRef]

42.

S. V. S. Sowndarya Y. Kim S. Kim P. C. St. John R. S. Paton 202321900 [CrossRef]

43.

Y. Kim B. D. Etz G. M. Fioroni C. K. Hays P. C. St. John R. A. Messerly S. Vyas B. P. Beekley F. Guo C. S. McEnally et al. Proc. Combust. Inst.2021381143 [CrossRef]

44.

J. Cho Y. Kim B. D. Etz G. M. Fioroni N. Naser J. Zhu Z. Xiang C. Hays J. V. Alegre-Requena P. C. St. John et al. Sustain. Energy Fuels202263975 [CrossRef]

45.

P. R. Wells Chem. Rev.196363171 [CrossRef]

46.

R. W. Taft J.-L. M. Abboud M. J. Kamlet M. H. Abraham J. Solution Chem.198514153 [CrossRef]

47.

C. Reichardt T. Welton Solvents and Solvent Effects in Organic ChemistryJohn Wiley & Sons2010

48.

A. Meijer S. Otto J. B. F. N. Engberts J. Org. Chem.1998638989 [CrossRef]

Journal Information

Article Information

Investigating Chemical Interpretability in Graph Neural Networks via Atom-wise Shapley Additive Explanations

Abstract