Social Network Link Prediction via node2vec
In this post, we'll go over how we used node2vec to predict which users would be friends in a social network.
We used Facebook dataset from SNAP.
The source code for our findings can be found here.
The Facebook dataset from SNAP provides an anonymized subgraph of Facebook's social network.
The subgraph has 4,039 users and 88,234 friendships between the users.
We use node2vec to generate vector representations of the users.
To predict whether or not two users are friends, we take the Hadamard product of their node2vec representations and put it through a logistic regression classifier.
The training process has 2 distinct parts, embedding the users via node2vec and training our logistic regression classifier.
To generate our training data for both parts of training, we use a process similar to what's described in the node2vec paper. We first take the entire graph from the dataset and sample several pairs of users who are not friends. These will be our negative edges. The number of negative edges we sample is equal to half the number of edges in the original graph. We then remove half of the edges from the original graph (while ensuring that the graph remains connected). These edges will be our positive edges. Our graph now has half as many edges as it originally did.
Training the node2vec embeddings on the users is an unsupervised process that is done first. We executed this process on the remaining graph. The sampled negative and positive edges do not come into play here. We used standard stochastic gradient descent with Hogwild!.
We then train our logistic regression classifier. We use the negative and positive edges here. The 30% of our data for training, 10% for validation, and 60% for testing. We used early stopping. The node2vec embeddings are frozen while we train the logistic regression classifier. We used the Adam optimizer with binary cross-entropy loss.
For hyperparameter tuning, each trial contains selections for hyperparameters for both the node2vec embedding process and the logistic regression classifier training process. We use a tree-structured Parzen estimator with successive halving.
The following hyperparameters were set during each trial of our hyperparameter search:
We ran several trials of hyperparameter search.
Here are the best results from our hyperparameter search ranked by validation loss.
Loading results...
Note that the total training time for each trial is measured while other trials were run simultaneously and asynchronously and may not reflect how long the trial would take to run in isolation.
We were able to achieve testing accuracies of over 92% using only 40% of the sampled edges during training. Preliminary experiments have shown that we can achieve higher testing accuracies if we use a larger training or validation set.
It seems that the method we explored here is rather effective given the small portion of our data used for training and accuracy of our predictions.
An intuitive explanation regarding why this method works so well is that a friend of a friend is likely a friend. In other words, social circles often overlap. Thus, random walks across a social network like those done by node2vec will likely stumble upon a friend.
Hopefully, this was a useful read. Please feel free to reach out if you have any questions or suggestions.
Interested in my work? See my projects on GitHub. |
Want to learn more about me? Visit my website. |