1 Introduction
Encouraged by the growing availability of low-cost sensors, multimodal fusion that takes advantage of multiple data sources for classification or regression becomes one of the central problems in machine learning [1]. Joining the success of deep learning, multimodal fusion is recently specified as deep multimodal fusion by introducing end-to-end neural integration of multiple modalities [2], and it has exhibited remarkable benefits against the unimodal paradigm in semantic segmentation [3], [4], action recognition [5], [6], [7], visual question answering [8], [9], and many others [10], [11], [12]. Multitask learning [13] is another crucial topic in machine learning. It aims to seek models to solve multiple tasks simultaneously, which enjoys the benefit of model generation and data efficiency against the methods that learn each task independently. Similar to multimodal fusion, multitask learning has also been developed from previously shallow methods [14] to deep variants [15], [16], [17], [18], [19] by taking advantage of deep learning. The successful applications of multitask learning include navigation [20], robot manipulation [21], etc.