Multimodal AI For Image And Video Inference: Generate Text Output From Visual Inputs | IEEE Conference Publication | IEEE Xplore