Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation | IEEE Conference Publication | IEEE Xplore