Self-supervised Object Detection Network From Sound Cues Based on Knowledge Distillation with Multimodal Cross Level Feature Alignment | IEEE Conference Publication | IEEE Xplore