QUERYD: A Video Dataset with High-Quality Text and Audio Narrations | IEEE Conference Publication | IEEE Xplore