An Ensemble of Vision-Language Transformer-Based Captioning Model With Rotatory Positional Embeddings | IEEE Journals & Magazine | IEEE Xplore