Expansive Knowledge Graph Enhancement on Graphcore Processing Units (IPUs)
The Open Graph Benchmark (OGB) Large-Scale Challenge (LSC) held at NeurIPS 2022 featured a Knowledge Graph (KG) track, designed to push the boundaries in graph machine learning on large-scale, realistic datasets. The goal of this competition was knowledge graph completion, where the objective is to predict missing links (relations) between entities in large KGs.
About the OGB-LSC Knowledge Graph Competition
The competition aimed to challenge participants to develop models that could accurately predict missing relations in a large, real-world KG, reflecting practical challenges such as scale, complexity, and diversity of relation types. The dataset used for the competition often involves millions of entities and relations, making scalability and efficiency critical. Typical evaluation metrics include Mean Reciprocal Rank (MRR) or Hits@K on a held-out test set, measuring the models' ability to rank correct tail or head entities for given queries.
Key Components of the Winning Submission at NeurIPS 2022
The winning submission at NeurIPS 2022 introduced several key innovations and techniques that contributed to state-of-the-art performance:
- Advanced Knowledge Graph Embedding Techniques: The winning team employed powerful knowledge graph embedding methods capable of capturing complex relational patterns. They often combined multiple embedding models or designed novel scoring functions to better represent entity-relation interactions.
- Scaling and Efficiency Strategies: Given the dataset’s large size, the solution focused on efficient training algorithms capable of handling millions of entities and relations without prohibitive computational costs. Techniques like model and data parallelism, mixed precision training, and memory optimization were leveraged.
- Negative Sampling and Training Strategies: Innovative negative sampling techniques to generate challenging negative examples during training helped improve the model’s discriminative capability. Carefully designed training routines such as curriculum learning or adaptive sampling improved convergence.
- Ensemble Methods: The winning approach often combined multiple models via ensembling, improving robustness and generalization. Different architectures or hyperparameter settings were ensembled to leverage their complementary strengths.
- Post-processing and Calibration: The team applied various post-processing steps such as re-ranking, calibration of prediction scores, or leveraging relation-level priors to boost final prediction accuracy.
- Feature Engineering and Auxiliary Data: Additional features or auxiliary information (when allowed) were carefully incorporated to empower the model beyond pure structural information.
The winning submission in the OGB-LSC KG competition at NeurIPS 2022 was a combination of sophisticated embedding techniques, training and sampling strategies, engineering optimizations, and ensemble methods. The team's approach demonstrated impressive results, with an ensemble of 85 KGE models achieving a validation MRR of 0.2922 and an MRR of 0.2562 on the test-challenge dataset.
Worthy of note is the use of the BESS (Balanced Entity Sampling and Sharing) approach, a distributed processing scheme for training KGE models, which guarantees that only tail embeddings have to be exchanged across workers. This helps balance communication and compute, making the solution more scalable and efficient.
The paper, "BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion," was published as a preprint on arXiv in 2022. The authors hope that their insights help the community in creating fast and accurate Knowledge Graph Embedding models and accelerate their adoption in real-world applications.
Graphcore submitted a winning entry to the Knowledge Graph track of OGB-LSC@NeurIPS 2022, showcasing the potential of advanced KG completion techniques in real-world applications. The ensemble was trained to a validation MRR of at least 0.2, and the final predictions are selected by ranking entities using a score derived from the ensemble.
It is important to note that the WikiKG90Mv2 dataset used in the competition has been created with a sampling method that leads to a severe generalisation gap if the sampling from the training dataset is not adjusted accordingly. This underscores the importance of careful data preprocessing and sampling strategies in achieving accurate knowledge graph completion.
In conclusion, the OGB-LSC Knowledge Graph competition at NeurIPS 2022 highlighted the need for scalable, efficient, and accurate modeling approaches to tackle large-scale knowledge graph completion. The winning submission demonstrated the power of a well-designed ensemble of KGE models, using the BESS approach, combined with advanced embedding techniques, training strategies, and engineering optimizations.
Data-and-cloud-computing played a crucial role in the winning submission at NeurIPS 2022, as the team leveraged techniques like model and data parallelism, mixed precision training, and memory optimization for efficient training of millions of entities and relations.
Artificial-intelligence was integral to the competition's success, with the winning team employing advanced techniques such as knowledge graph embedding methods, negative sampling strategies, and ensembling, all of which contributed to state-of-the-art performance in large-scale knowledge graph completion.