MolCloze: A Unified Cloze-style Self-supervised Molecular Structure Learning Model for Chemical Property Prediction | IEEE Conference Publication | IEEE Xplore

MolCloze: A Unified Cloze-style Self-supervised Molecular Structure Learning Model for Chemical Property Prediction


Abstract:

Machine Learning approaches are required to predict accurately on test samples that are distributionally different from training ones in the fields of drug discovery, com...Show More

Abstract:

Machine Learning approaches are required to predict accurately on test samples that are distributionally different from training ones in the fields of drug discovery, computational biology, and cheminformatics. However, (i) labeled task-specific molecule data are often scarce, and (ii) poor generalization due to test molecules that are structurally different from those seen during training. To alleviate the problems, we propose a cloze-style self-supervised learning model (MolCloze) to obtain universal informative representations for molecular property prediction tasks. With carefully designed self-supervised tasks unifying generative- and discriminative-paradigm, MolCloze can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. To capture such complex information, we design two novel strategies - Structural Fingerprint Tokenization (SFT) for better tokenizing molecule graphs, and Normalized Graph Raw Shortcut-connection (NGRS) for better latent representations by training a deeper model. We pretrain the MolCloze model via three tasks, which are Unordered Masked Language Modeling (UMLM), Replaced Masked Token Detection (RMTD), and Contrastive Energy-based Unmasked Token Clozing (CE-UTC). Then, we transfer the pre-trained model to a broad range of downstream molecular property prediction tasks via minor architecture modification. Extensive experiments demonstrate the generalizability of MolCloze by predicting a broad range of chemical properties which are related to drug discovery. We also observe significant performance boost on different downstream molecular property prediction datasets, achieving higher performance than the state-of-the-art baseline approaches and previous pre-training techniques developed for molecule data.
Date of Conference: 09-12 December 2021
Date Added to IEEE Xplore: 14 January 2022
ISBN Information:
Conference Location: Houston, TX, USA

References

References is not available for this document.