I. Introduction
Vehicular networks as a key enabler for intelligent transportation have received growing attention from both industry and academia in recent years [1], [2]. It is expected that an unprecedented amount of data will be shared through real-time communications between vehicles and infrastructure to support various new services such as advanced vehicle control and safety. Traffic routing in vehicular networks that are characterized by high-mobility nodes, dynamic channel conditions, and frequent topology changes requires solving a challenging online optimization problem [2], [3], [4], [5], [6], [7], [8], [9], [10]. To this end, learning techniques – especially reinforcement learning (RL) – have been employed for online decision making in vehicular network routing problems and showed great promise [11], [12], [13], [14].