MUTAG Classification

Introduction

This article goes over our findings for classifying chemical compounds based on their mutagenic effect on a particular bacterium.

This effort was motivated by a desire to verify the validity of the results from graph2vec. Rather than directly reproducing the results by using an SVM classifier, we chose to use a feedforward neural network.

We use the MUTAG dataset. It can be found here.

We were able to achieve similar results and give an explanation of this in our result summary found later in this article.

The source code for our experiments can be found here.

Experiment Overview

The MUTAG dataset contains 188 chemical compounds that are labelled according to their mutagenic effect on a bacterium. The dataset represents each chemical compound as a graph where nodes are atoms and edges are chemical bonds.

Here's example # from the dataset:

The graph2vec paper split the data to use 90% for training and 10% for testing. We similarly used 10% for testing, 10% for validation, and 80% for training.

Here's a depiction of our model.

Dropout was applied after every dense layer.

The rounded result (either zero or one) was used to determine our final result. This means we used a threshold of 0.5. We did not optimize this threshold.

We used binary cross entropy as our loss function and the Adam optimizer. Rounding to get the final result did not happen during training (we simply used the result from the sigmoid).

The hyperparameters for our model were:

Weisfeiler-Lehman Iteration Count / Rooted Subgraph Size for graph2vec
Embedding Size for graph2vec
Number of graph2vec Training Epochs
Learning Rate for graph2vec (used solely for graph2vec training)
Batch Size
Neural Learning Rate (used to train all parameters outside of graph2vec)
Number of Dense Layers
Gradient Clipping Threshold
Dropout Probability

We first trained our graph2vec embeddings and then trained the model itself independently (with the embedding vectors frozen).

We used early stopping when training the neural classifier but a fixed number of epochs for training our graph2vec embeddings.

Included in the dense layer count selection was the option to use zero dense layers, which turns our model into a logistic regression model.

We tuned our hyperparameters using two methods, via random search and via a tree-structured Parzen estimator with successive halving.

Experiment Results

We're using the same percent of the dataset for testing as done in the graph2vec paper. Whether or not this small test set size was problematic (e.g. might lead to overfitting, especially given the small size of the dataset as a whole) was not explored as our goal was to merely produce similar results to the paper.

We ran trials of hyperparameter search.

Here are our best overall results by validation accuracy:

Here are our best overall results by validation loss:

Note that some of the models above have zero dense layers, i.e. they are logistic regression models.

The graph2vec paper reported 83.15% accuracy (with a standard deviation of 9.25%) on the MUTAG dataset. Given that we were able to achieve 89.47% test accuracy, it seems that the results from the graph2vec paper on the MUTAG dataset are reproducible with feed-forward neural network and logistic regression classifiers in addition to the SVM classifier used in the paper.

One flaw in our experiment was that our validation set was the same small size as the test set. We had models achieve a validation accuracy of . However, some of them had testing scores as low as . Thus, the validation accuracy wasn't a great indicator of whether or not the model was sufficiently accurate.

Validation loss served as a better metric as the models with the best validation losses tended to have better test losses. However, there's still a large amount of variance in the testing and validation accuracies of the models with the best validation losses, so validation loss isn't an ideal metric either.

If we split the data such that the validation dataset was larger, then it's possible that there'd be less variance in testing loss among the models with the best validation accuracies and among the models with the best validation losses since the validation set would be more representative and would help prevent overfitting (and would yield better testing results). This is something we'd keep in mind in the future if we were to conduct a similar experiment again.

Conclusion

It seems that we were able to reproduce the results from graph2vec with both a feed-forward neural network architecture and logistic regression models (without tuning the classification threshold, which we hard-coded at 0.5).

Though their results on the MUTAG dataset were reproduced here, it's noteworthy that, as stated in the paper, graph2vec was not aimed at solving such small datasets. There might be better methods aimed at handling small datasets that more robustly handle problems that come up when dealing with small datasets, e.g. picking a sufficiently large and representative validation or test set.

Interested in my work?

See my projects on GitHub.

Want to learn more about me?

Visit my website.