I. Introduction
In many machine learning tasks, different modalities can be modeled as different domains, and different data can be modeled as different views of the same reality. For example, in the context of an autonomous vehicle, the RGB camera, the depth map, and the segmentation map can be considered as three views of the same reality. Often one domain can be considered as an input domain, e.g. a CT (computed tomography) scan of a patient, and other domains as an output domain, e.g. the organ segmentation of the scan. Here we do not consider an input nor an output domain, but we would like to learn to translate from any domain to another one. This definition masks the classical definition of an input and output domain, since any domain can be a possible input or output. While the rising amount of available data has brought great results in domain translation tasks in a fully supervised fashion, in some fields the quantity of available supervision is limited and hard to obtain, as in the medical field where a domain expert is required to create a hand-made segmentation [1], [2], and this low supervision setting often reduces the performance of classical deep learning models.