17.7 Using the Pg2vec Algorithm

Pg2vec learns representations of graphlets (partitions inside a graph) by employing edges as the principal learning units and thereby packing more information in each learning unit (as compared to employing vertices as learning units) for the representation learning task.

It consists of three main steps:

  1. Random walks for each vertex (with pre-defined length per walk and pre-defined number of walks per vertex) are generated.
  2. Each edge in this random walk is mapped as a property.edge-word in the created document (with the document label as the graph-id) where the property.edge-word is defined as the concatenation of the properties of the source and destination vertices.
  3. The generated documents (with their attached document labels) are fed to a Doc2Vec algorithm which generates the vector representation for each document, which is a graph in this case.

Pg2vec creates graphlet embeddings for a specific set of graphlets and cannot be updated to incorporate modifications on these graphlets. Instead, a new Pg2vec model should be trained on these modified graphlets.

The following represents the memory consumption of Pg2vec model.
O(2(n+m)*d)
where:
  • n: is the number of vertices in the graph
  • m: is the number of graphlets in the graph
  • d: is the embedding length

The following describes a few use cases where the Pg2vec algorithm can be applied:

  • Chemical Compound Analysis: To represent chemical compounds as graphs and use Pg2vec to find similarities between compounds. This helps in drug discovery and chemical research.
  • Document Classification: To represent documents as graphs of words or phrases (for example, using dependency parsing in NLP) and classify them into topics or genres based on their embeddings.
  • Network Comparison: To compare different social or biological networks by generating embeddings for entire graphs. This can be used to study the evolution of networks over time or to compare different species’ protein interaction networks in biology.

The following describes the usage of the main functionalities of the implementation of Pg2vec in PGX using NCI109 dataset as an example with 4127 graphs in it: