I. Introduction
Task-oriented dialogue systems (ToDs) can complete user goals such as hotel bookings and restaurant reservations, which gains increasing attention. Traditional ToDs consists of modularly connected components for natural language understanding (NLU) [1], dialogue state tracking (DST) [2], dialogue policy (DP) [3] and natural language generation (NLG) [4] module. In recent years, end-to-end task-oriented dialogue systems (EToDs) has emerged in the literature, which use a unified sequence-to-sequence model to generate a response given a dialogue history and knowledge base (KB) [5]. For example, given the dialogue history “Send me to the nearest gas station, I want to fuel my car.” in the first turn and the corresponding knowledge base in Fig. 1, EToDs can directly produce the system response “The nearest gas station is valero at 200 Alester Ave, 7 miles away setting directions now”, which does not need any intermediate supervision.