Vision Transformer based Audio Classification using Patch-level Feature Fusion | IEEE Conference Publication | IEEE Xplore