Image Captioning with positional and geometrical semantics

The last 5 to 6 years have seen tremendous progress in automatic image captioning using deep learning. Initial research focused on the attribute-to-attribute comparison of image features and texts to describe the image as a sentence, the current research is handling issues related to semantics and correlations. However, current state of art research suffers from insufficient concepts when it comes to positional and geometrical attributes. The majority of research relying on CNN’s (Convolutional Neural Networks) for object feature extractions has no clue about equivariance and rotational invariance which leads towards the orientation-less understanding of objects for captioning along with longer training time, and larger dataset. Furthermore, CNN’s based image captioning encoders also fail to understand the geometrical alignment of object attributes within the image and hence mislabels distorted as correct. To cater to the above issues, we propose ICPS (Captioning with Positional and geometrical Semantics) a capsule network-based image captioning technique along with transformer neural networks as the decoder. The proposed ICPS architecture handles various geometrical properties of image objects with the help of parallelized capsules while the object-to-text decoding is done by Transformer Neural Networks. The inclusion of cluster capsules provides better object understanding in terms of position, equivariance, and geometrical orientation with more augmented object understanding over a small dataset in comparatively less time. The extracted image features provide a better understanding of image objects and help the decoding stage to narrate effectively with positional and geometrical details. We trained and tested our ICPS over the Flickr8k dataset and found ourselves to be better at captioning when it comes to describing the positional and geometrical transitions as compared to other current state-of-the-art research.


I. INTRODUCTION
T HE field of computer vision is experiencing new challenges and ventures on a day-to-day basis both in academia and in industry. These challenges are targeting various segments of human life at a scale. Among them, Image captioning is a burning and hot topic. Image captioning is just like modeling the human behavior of seeing and describing. Having a robust and human-like image captioning model would ensure that the machine understanding of the scene is just like us which will be a brave step towards a mature and effective artificial intelligence. Various researchers around the world are trying to create meaning full and comprehensive image-to-text generation with the help of deep learning. However, it requires global content understanding within the images and efficient state-of-the-art natural language modeling techniques. Both are challenging within their domains.
The general approach adopted by researchers is to have an encoder-decoder architecture: the encoding job is done with the help of convolutional neural networks while the decoding role is played by either recurrent neural networks or its variants such as LSTM/GRUs [1,2,3]. The inherent capabilities of convolutional neural networks have been promising for image captioning encoders and have given satisfactory results while the natural language capability of recurrent neural networks has been effective in performing the job of decoding. However, the encoding module suffers from covariance and geometrical understanding while the decoding part suffers from simultaneous visual representation mapping and language model learning. These have been major challenges towards making a more human-like image captioning model till today to the best of our knowledge.
There has been much research available to make a robust and human-like image captioning with the help of variants of convolution neural networks and recurrent neural networks. These works have been successful and somehow significant [4,5]. Considerable progress in the field of image captioning has been made due to the inclusion of semantic conceptualization over image and image objects. The semantic understanding somehow reflects the human cognition behavior and makes the captaining more human-like [5,6]. The semantic understanding of objects has enabled the encoding stage to provide better multi-label classification support towards the decoding stage and helps the decode to mine better language concepts from the training data. The detected concepts allow the decoder to generate superior captions especially working on the test data coming from the same training dataset. However, we believe that the existing state-of-the-art works providing semantical conceptualization suffer from positional and orientational details. This causes the entire captioning meaningless in terms of activity understanding in the image of objects. An example of our argument is given in Figure. 1.  . is an image captioned by the various state-of-theart works in the domain of image captioning. A few captions are "Dog swims in the water ", "Dog standing in water ", etc. Mentioned captions are great in terms of the detail of feature to text translation, however, lack positional and orientational details. We propose a technique termed ICPS in this paper, our ICPS algorithm captions the same image as "Dog standing near the water" which indicates the relative position of the dog from the perspective of the water in the image. The reason for losing the positional and orientational semantics into the captioning work is due to the training captions used for training these models. Generally, training captions are acquired by conducting random captioning by humans which tries to describe the objects and their activity within the image instead of going through the pain to describe the positional and orientational details. The annotation data having no positional and orientational details is used to train the model. This training leads to an insufficient understand-ing of orientations and positions. To cater to the issue either we can re-generate annotation with required information by investing time and money or we can leverage the orientational and geometrical capabilities of capsule networkbased feature extraction in the encoder stage. A capsule network is composed of capsules that store information in the form of vectors instead of a scalar. The vector storage helps to store a feature along with its orientation or angle. The standard capsule network uses the concept of routing by agreement to pass the information from the previous layer to the next layer as a replacement for the pooling layer. The combined magnitude and orientation information passed to the next layer helps to understand the affine transformation and geometrical behavior of the object within the image. Despite the effectiveness of capsule network and out-of-box performance over MNIST dataset as presented in [32], the major drawback is the training time when complex data feds to the model. The routing by agreement algorithm takes a significant amount of time when the network parameters are increased as compared to any available convolutional neural network variants. Running capsule network over Flicker8k dataset also possesses the same challenge of training time exponentiation and large parameters networks which are not tractable and tangible for research. To cater to this issue, we have used a fusing technique over parametric information coming from parallelized capsules. The fusing behavior allows the parameters to be reduced for tractable processing while the parallelization enables the network to learn for more depth over the dataset. The ICPS encoder architecture is composed of stack convolutional neural networks [7,8,9] along with skip connections to provide better convolution over the input images; in addition, the parallelized capsules handle various details for example object magnitude, orientation, spatial information, etc. The parallelization significantly increases the performance and depth of understanding along with the reduction in parameters at the fusing point. We tested out our ICPS over flicker8k [10] dataset and found it quite interesting and comparable to the current state-of-theart research work in image captioning. The key milestones which we achieved in our research work are: • An innovative Encoder network capable of learning the spatial, geometrical, and orientational details due to parallelized capsules network with improved performance for image captioning. • Improved baseline capsule network with more complex datasets training in a tractable amount of time for image captioning. The rest of the paper is organized as follows. Section 2 discusses the related work and existing issues; Section 3 provides details of our methodology while section 4 provides descriptions of our experimentation, testing, and evaluation. Section 5 provides a conclusion and future research direction.

II. RELATED WORK
Most of the latest research work in the domain of image captioning uses the fundamental architecture known as encoder-decoder architecture or framework. The framework executes in the way that the image to feature extraction is done using the encoder module which can be constructed in several ways with various deep learning networks e.g., convolutional neural networks, auto-encoders, GAN, transformers, capsule networks, etc. The job of the decoder module is to map the features with the provided annotations during training to learn the conversion of features into human language. Once trained the network can humanly annotate a randomly provided image with various details.
Basic encoder-decoder architecture is composed of CNN (as encoder) and RNN (Recurrent Neural Networks) (as a decoder). The image is fed to CNN for feature conversion while features are fed to RNN for mapping against the annotation words [12,13,14,15]. To make the network more innovative and efficient various additions are done in the model for example incorporation of visual attention mechanisms [16,17], region of interests, and attention behaviors [18,19]. A significant group of researchers believes that attention and visual attention would help in better understanding objects and their behaviors in the process of image captioning. Since visual attention is due to the higher-order convolutional work which reduces the spatial and localization information and reduces the semantic impact on the output. Similarly, the region of interest application over images during the encoding phase is also prevalent in the field of image captioning. The idea is to use multiple image R-CNN-based object detectors and extract features from those regions for captioning. This helps in generating more verbose captions for each region separately but at the same time losses all the semantics and spatial relations among objects lying in the inter-region spaces of an image. However, despite the incorporation of effective techniques, there has been a serious gap persisting between the image and its generated caption for generalpurpose use in daily life. This enables the requirements of handling semantic concepts of images and objects inside the image and making use of them while performing the captioning. Semantic understanding requires more than just mapping the representations during training and being able to produce an output during a test. The latest research focuses on semantically-oriented image captioning which concerns the object behavior, posture, and attribute during training and use them while doing the evaluation [20,21,22].
Among the list of the current state-of-the-art works in the field of image captioning which also resembles our idea up to a certain extent only is the use of graph convolution neural networks to understand the global and regional context of an image and its objects [23,24,25]. The graph convolution neural networks are used to understand the semantic and spatial relationship which helps the captioning model to generate spatial tokens e.g., towards, inside, near, etc. Another approach used in the work [26] utilizes scene graph understanding along with objects to find the possible correlation of background/scene along with objects.
Current trends in the research community towards image captioning are more focused on using the visual mapping and correlations among the objects of the image and using these visual representations in the generation of captions. The usage of visual relation enables the captioning network to work on the semantics that helps in predicting the object and its behavior based on the subject [27,41]. The latest shift in the learning of directions and geometrical understanding of objects in the image for captioning is critical towards more human-like caption generation and is being worked out on a global scale in the research community [28,29,41].
Despite extensive research and trends of using various techniques to understand the underlying semantics of objects in the image and translate it to a human-like text, we find a large gap in the research, pertaining to handling the geometrical and orientational details of the objects and mapping them in the generating caption.
Our research is aimed at providing a more thorough and in-depth approach towards finding the connections between spatial and geometrical semantics of image objects and converting them into respectable features for training the decoder stage.

III. APPROACH
This section discusses our approach used in ICPS for handling spatial and geometrical features of images for image captioning. Our proposed architecture is composed of an encoder-decoder framework. The encoder is composed of parallelized capsule networks while the decoder is based on simple transformer neural networks. The use of capsule networks is based on the objective of being able to understand the geometrical and spatial details of objects while transformer neural network is due to it being the current state of the art in NLP [30].

A. ENCODER NETWORK
The baseline of our approach is the inclusion of a parallelized capsules network as an encoder stage for image captioning. The parallelized capsules network architecture allows for feature-specific learning in terms of spatial and geometrical contents on image objects. Our proposed architecture is given in Fig. 2.
A fundamental architecture of capsule network is composed of 2x convolution layers on the input with a generic 256 channels each with a 9x9 filter box and over a stride of 1. The activation layer is ReLu. The proceeding layer is also a convolution capsule with 6x6 of primary capsule grid along with 32 channels. This layer receives scalar input from the previous convolution layer and produces an 8-D vector over output. The squashing function handles non-linearity and outputs a 16-D vector for 10 classes of MNIST. The next layer performs probability calculation from the input fed by the squashing unit over a 16-D vector. Later the reconstruction of the image is done using the final FC layers. This is the fundamental architect given in [32], and it is termed capsule networks with dynamic routing. The architecture executes in the way that the features are extracted using convolution layers and then fed to the primary capsule layer. Capsule VOLUME 4, 2016 in primary capsule layer has an associated activity vector to encoder spatial and geometrical information. The output from the primary capsule layer is then fed to the digitcaps layer which performs dynamic routing over the activation value along with coupling coefficients and then yields class probabilities which are then passed through fully connected layers to reconstruct the output image. Generally, there are 3 fully connected layers.
Every capsule in the capsule network architecture is responsible for focusing on a small area of interest and provides object details as a vector to the next capsule in the next layer [31,32,33]. Capsules are just a group of neurons having activation vectors as their instantiating parameters while the length of vectors defines the probability of the existence of a feature. Fig. 2. provides a simple description of the fundamental building blocks of the capsule network. The relation of likelihood-based routing works well in the case of a simple dataset of images i.e. MNIST, Fashion MNIST where there is only 1 channel and a single object inside the image. However, the usefulness of capsule networks decreases when used for complex datasets i.e. Flicker, MSCOCO, CIFAR, and SVHN, etc. The presence of multiple channels and complex images having multiple objects not only increases the training time exponentially but also yields below state-of-the-art results [42]. To cater to multiple channels & objects simultaneously our proposed encoder architecture handles the complexity issue with the help of parallelization on the input. In our architecture, we have used 16x capsules as a parallelized encoder input stage, each responsible for at least 1 object at a time, as a primary capsule cluster. The input convolution layer takes 3-dimension input having channels, kernel height & kernel width. After the extraction of features from convolution layer & passed through the activation layer, the input data is then converted into the batches of 16 and termed as D pc , one for each capsule in the primary capsule layer, which results in a 4-dimensional block having channels, kernel width, kernel height and block size. Each capsule, from the set of 16 parallel capsules in the primary capsule layer, is fed with the input and expected to calculate object magnitude along with their geometrical & positional information by the help of angle The calculation yields better in the way that there are 16 possible views for position calculation in our proposed cluster capsule network architecture as compared to simple capsule network architecture. The resulting matrix from all capsules in the primary capsule cluster is then aggregated and flattened to be fed to the feature capsule cluster which then yields the feature probability matrix based on the dynamic routing and squashing function. The feature capsule generates a feature matrix of size N. The generation of features instead of classes is way faster and provides better traction in training the model since features provide more drill-down information and eliminate the need to have a flattened and sigmoid layer for calculating the class probabilities. The primary capsule cluster layer is fed by 512 feature maps convolution neural networks. This combination provides a 512 + 36 feature engine leading towards a better understanding of the input image features in terms of the relative position and geometry. Furthermore, the architecture uses the skip connection in the convolution layer to ensure that discriminating features are fed to the primary capsule cluster layer to generate a better-performing matrix. Fig. 3 provides a detailed view of our proposed architecture for encoder modules using capsule network clusters.  Figure. 4 provides a description of a single capsule with details of its dimension in our proposed primary capsule cluster for ICPS. The actual input dimension is 3 having kernel width, kernel height and channels however, the batch size serves as the dimension as well to generate the cluster as per the input feed. 4 The input parameters are calculated using the equation D pc * C n * k w * k h . where: • D pc is the dimension of the capsule network • k w and k h are the width and height of kernels respectively • C n is the number of input channels.

B. DECODER
The decoding stage of our architecture is based on the transformer neural networks. The selection of Transformer is done due to large dataset handling along with parallelized behavior for caption generation and self-attention mechanism [34,35]. Features are injected into the Transformer Neural Network during the training. Figure 5. shows a simple architecture of our transformer neural network. The image features having object detail, shape, geometry, and location is processed through the input embedding layer to reduce the dimension from D n to D m . This reduction allows a smaller computation time during the training stage over the cost of smaller loss in the input information. One major difference in our approach is that the decoder stage is dynamic as per the output features of the capsule networksbased encoder stage. This will enable the Transformer Neural Network to learn with the dynamic behavior of input and will also provide a sufficient boost in the learning behaviors.
The operation in the encoder stage of the Transformer for a single layer works in the way that each feature F i is fed to the input embedding layer. The input embedding layer converts the feature vectors into embedding vectors.
This feeding behavior is not recurrent as of RNN/LSTM, instead, the application of the Multi-headed attention layer expects concurrent injection of feature vectors. This concurrent behavior provides a massive boost in the performance and also helps in ensuring long-term dependencies in the text. However, the simultaneous flow of features forces the transformer neural network towards losing the position or order of feature words/vectors within the text. The order or position of words/features are critical for grammar, linguistic, and making sense of the sentence. The workaround is to attach a position-dependent signal with each feature and is termed as positional encoding in the transformer neural networks. The In equations 1 and 2, pos refers to the position of the feature in the sequence of features, the positioning starts from the feature at index 0. dmodel defines the depth of feature embedding, i.e. the total number of words/tokens in the sequence and i refers to the index number for each dimension of the embedding. The term P E is short for positional encoding while the pos, 2 i + 1 and pos, 2 i are odd and even representations.
The application of cosine and sin functions helps linear models to learn easily about positional encoding. sin and cosine functions can be taken as the continuous counter effects of binary positions in the domain of floats. It is just like alternating bits over the wave. However, the alternating position will appear again and again for various features/words so the inclusion of i will vary the frequency of the wave and change the positional argument for every feature/word. The same statement applies to the cosine function as well. Fig. 6, provides a descriptive behavior of the discussion. Each dimension corresponds to a sin wave and creates a geometric progression from 2pi to 10000.2pi. The value 10000 is a scaling factor in the positional encoding which helps in creating a fairly large circle to compensate approx. 1k or fewer features/words.
The positional embedding input is passed through the Multi-headed Attention layer. The number of heads in multiheaded attention depends upon the number of parameters we would like to learn.  The entire architecture of our proposed research can be viewed in Figure 7. The results acquired over Flicker8k datasets are significantly better than the current state-of-theart papers in terms of positional and geometrical semantics. Along with the positional information our ICPS has better captioning by giving adverb information in the caption. A few examples of our ICPS output are given in Figure 8.   provides the traction in our model that despite running for a few epochs it provides a more comparative and, in some cases, significantly better results. Despite the BLEU-4 scores being higher than [37] [38] in Fig. 8 the response over text generation is far superior and provides the understanding of position and geometry within it over a very limited training resource consumption. Similarly, the METEOR score in Fig.  8 is slightly higher than of [38], while the image feature geometrical and positional understanding is superior as compared to the state-of-the-art works. BLEU and METEOR scores are benchmarks in image captioning research, however, they do not have the tendency to account for positional and geometrical information within the text. BLEU performs text to text comparison while METEOR tries to find the recall and precision of the uni-grams within the text. Both are textual evaluation metrics. The scores are used to provide the relation of caparison between ICPS and the mentioned state-of-the-artworks.

IV. EXPERIMENTATION
The capability of our ICPS model having capsule clusters for geometrical and positional inferences is demonstrated over Flicker8k and Flicker30k datasets. The selection of flicker datasets is done on the basis of simplicity, size, and density of the image matrix. The architecture of capsule networks requires a significant amount of processing resources when used for complex datasets i.e. MSCOCO. In our case, we are using a cluster of capsule networks which becomes more resource-intensive during training and computation. To have tractable results with limited resources we evaluated our methodology over flicker8k and flicker30k datasets only. The resource constraints have limited us to train our model up to 100 epochs over NVIDIA Quadro P4000 GPU with 64GB of RAM and Xeon processor. The batch size was selected as 16 while the learning rates were 0.001 and Adam was used as the optimization function. The images were scaled down during the initial run to understand the PoC while in the final training the scaling factor was removed. The training, validation, and test sets were taken 80%, 10%, and 10% respectively for each dataset. So for flicker8k 4800 images were taken for training while 600 were for validation and 600 images for testing. A quantitative comparison of generated captions over the same images is given in Table 1. It is worth noting that all the test images used in [38], [39], and [40] are from the MSCOCO dataset and our ICPS being only trained over the flicker dataset provides a more comprehensive position and orientation caption against them.   Table 2. provides a comparison of our ICPS results for BLEU-4 & METEOR with the mentioned state-of-the-arts research. The comparison is done in a standard way to reflect the outcome of our ICPS with a minimum amount of training and resource consumption. The novelty of our ICPS lies in being able to be trained with a limited amount of data in a very short amount of training time for a comparatively better result.
We used the publicly available code [36] for baseline

B. FLICKER30K
As the name suggests, Flicker30k is of 31k images collected from the Flicker group and each image contains 5 associated captions in text. The captioning is done via manual human annotation. The total captions are 158k with 244k referencing chains. The flicker30k also comes with bounding boxes which are 276k in the count. Our initial round of experiments is in comparison to the paper [36] and follows the same ethics to understand the Capsule Network baseline better, however, we never moved towards final classification and reconstruction instead we limited ourselves to feature extraction only. Then we updated the code to perform parallelization behavior in feature extraction and compared the results with non-parallel feature extraction with original code over the Flicker8k dataset. The graph in Figure. 11 and 12 provide a comparison of feature extraction accuracy and speed between simple capsule network [36] and our parallelized capsule networks. The higher the value of accuracy the better the result is while the lower the value of time consumption shows a better performing network. Table. 3 & 4 contains a few values in bold which are the best ICPS results in terms of accuracy and efficiency for the Flicker8k and 32k datasets and demonstrated 4-5% improvement over the existing result. The results are promising and have a 98% recall rate over 10 tries of execution with the same conditions. Both the tables above provide comparative analysis in terms of the accuracy and efficiency of our capsule network cluster designed for the image captioning job. The results VOLUME 4, 2016  reflect that our ICPS has significantly better efficiency in performing the feature extraction from images as compared to the original capsule networks. The delta in terms of seconds is huge and provides an above 100% boost in the processing of a single epoch for a batch of 128 images.

C. REAL WORLD EXPERIMENT
In addition to test and validation over Flicker8k and Flicker30 datasets, we tested our model over random images taken from google images for the purpose of real-world validation tests and the results are given in Figure 13. Although the results provide better captioning in terms of positional and geometrical information, however, in some cases the model can be wrong in captioning and understanding the image. For example, the caption done in 10-B is wrong as compared to the original image. The image is of a mountain range with curves, which our model understands as a person.

V. CONCLUSION
Our research has introduced an innovative approach for image captioning with the use of capsule networks. We made the model tractable and tangible in time and resource consumption which is currently a big bottleneck for capsule network-based computer vision solutions. The idea we put forth is that of parallelization of capsules and extraction of features instead of classes which reduced the time of training by removing the final stage in the capsule network. The inclusion of transformer neural networks with dynamic heads also performed better and provided human-level captioning. The feature extraction from capsule clusters helped in finding the positional and geometrical semantics from the image, which relates better in captioning and making it more verbose and human-like. The achieved values of BLEU and METEOR scores indicate that the captioning has improved