1. Introduction
The spread of misinformation whether in the form of a full-fledged news article or just a small tweet has raised significant concern in various domains e.g., politics, finance, society, and others[1], [2]. According to Weibo’s 2020 annual report, [42], 76, 107 news contents shared on Weibo social media platform were identified as false by the authority all year round. As an emerging field of research, evaluating misinformation has attracted attention of researchers across multiple disciplines (Social Science, Communication, Journalism, Computer Science). To ensure maximum impact in its audience, content creators of such misleading news articles frequently utilize multi-modal information, e.g. texts and images, to describe topics. A specific type of malicious multi-modal manipulation efforts, deep fakes [27], [39], [6], [12], has received significant attention from researchers, who attempt to develop automated methods for detecting such distortions. Nevertheless, a common phenomenon in recent years, popularly known as Out-of-Context images [15], [36], is far more prevalent a means to spread misinformation. It leverages existing unaltered images as is, but represents an irrelevant and misleading fact via newly coupled text.