Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

The image caption generation algorithm necessitates the expression of image content using accurate natural language. Given the existing encoder-decoder algorithm structure, the decoder solely generates words one by one in a front-to-back order and is unable to analyze integral contextual information. This paper employs a Bi-LSTM (Bi-directional Long Short-Term Memory) structure, which not only draws on past information but also captures subsequent information, resulting in the prediction of image content subject to the context clues. The visual information is respectively fed into the F-LSTM decoder (forward LSTM decoder) and B-LSTM decoder (backward LSTM decoder) to extract semantic information, along with complementing semantic output. Specifically, the subsidiary attention mechanism S-Att acts between F-LSTM and B-LSTM, while the semantic information of B-LSTM and F-LSTM is extracted using the attention mechanism. Meanwhile, the semantic interaction is extracted pursuant to the similarity while aligning the hidden states, resulting in the output of the fused semantic information. We adopt a Bi-LSTM-s model capable of extracting contextual information and realizing finer-grained image captioning effectively. In the end, our model improved by 9.7% on the basis of the original LSTM. In addition, our model effectively solves the problem of inconsistent semantic information in the forward and backward direction of the simultaneous order, and gets a score of 37.5 on BLEU-4. The superiority of this approach is experimentally demonstrated on the MSCOCO dataset.


I. INTRODUCTION
Image captioning [1], [2], [3], [4] serves as a complex multimodal scene understanding task involving two fields of study: computer vision [5], [6] and natural language processing [7], [8], whose purpose is to automatically generate proximate natural language captions for the salient visual content of input images. This task requires the model to complete the following actions: First, the model allows for comprehending the visual content in the image by identifying salient elements in the image with their mutual correspondence. Second, on the basis of these visual understandings, the model is also able to accurately describe these structured visual information word by word using natural language. Dynamic multi-modal analysis and reasoning are performed on the visual content, as well as generated words in the course of The associate editor coordinating the review of this manuscript and approving it for publication was Kumaradevan Punithakumar . caption word generation. At present, the image captioning model is primarily based on the encoder-decoder [9], [10] approach, whose model solely examines the image's global region while generating the image caption. The encoder transforms the image as the average value of global area features, ignoring the image's local saliency. As a consequence, the attention mechanism [11] is applied to image captions, the extracted visual features are normalized into a set of weight values, and the external visual features of the encoder are corresponding to its internal semantic features, further improving the model's interpretability. In recent years, visual attention [12], [13], [14] and semantic attention [15] have proved their superiority in this domain.
The common difficulty with most approaches lies in that the deep neural network based on LSTM [16] simply considers unidirectional data input while ignoring the impact of the orientation of the sequence on prediction. Yet the prediction of a sentence is supposed to be determined by the context, it hence is imperative to consider both prior and subsequent moment information. To address this issue, this paper employs the Bi-LSTM [17], [18] structure, which comprises two LSTM neural networks, featured by one forward and one backward. In contrast to the traditional unidirectional LSTM network [19], [20], [21], the Bi-LSTM structure considers the inherent laws of forward with backward data simultaneously while predicting from both the past and the future. Besides, it employs two independent hidden layers to respectively process the forward and backward semantic information. Then, the forward and backward outputs are drawn upon summation. The content is extracted from the forward and backward LSTM. As illustrated in Fig. 1, forward and backward extract semantic features about ''riding'', ''wave'', and attention mechanisms extract salient regions.
When Bi-LSTM is employed as the decoder, the image captions generated by the forward and backward generation approaches for the same image are prone to vary widely, and the semantic contents of the same time step barely match. When the current word is generated in forwarding order, the backward generation approach fails to offer effective context information synchronously; similarly. When it is generated in backward order, the forward generation approach fails to provide valid context information synchronously either. Therefore, aiming to fully utilize context information while addressing the issue of out-of-sync between forward and backward directions, this paper proposes the S-Att, which employs the subsidiary attention mechanism between F-LSTM and B-LSTM, extracting the correlation intensity of the F-LSTM and B-LSTM semantic information. As a result, the semantic information is aligned and output complementary. This method addresses the limitation that forward and backward synchronization semantics are incompatible and cannot be produced, contributing to more precise sentence predictions.
Consequently, our final model employs the CNN-Bi-LSTM-s encoder-decoder, as indicated in Fig. 2. CNN is employed to extract features and attention mechanisms to extract salient regions. Bi-LSTM is employed to extract contextual information, with S-Att raised to fuse semantics and align complementary outputs.
In summary, our main contributions are shown as follows: • We adopt Bi-LSTM as a decoder to extract different directional features to obtain more fine-grained contextual information.
• We adopt the subsidiary attention mechanism to fix the semantic information, and align the forward and backward hidden states through the similarity module to improve the output accuracy.
• We fuse the features extracted by visual attention and subsidiary attention to obtain complementary and progressively finer grained sentences.

II. RELATED WORKS
Following various generation approaches, the current major image caption generation algorithms are split into three types [19]: module-based matching algorithms [22], [23], [24], migration-based algorithms [25], [26], and neural network-based algorithms [1], [2], [11]. The module-based matching algorithm first identifies the objects, attributes, actions, coupled with other information present in the image using multiple classifiers, and then puts the detected information into a manually designed sentence module to generate image captions. Although this algorithm is considered straightforward and intuitive, it remains difficulties to recognize more sophisticated image information and unable to generate sentences with more complicated structures given the constraints of classifiers or sentence modules [23].
The migration-based algorithm retrieves similar images in the existing database and then regards the caption of the similar image as the caption of the image to be queried. Since the sentences in the database are entirely human-generated, the migration-based algorithm produces grammatically correct sentences. However, considering that the searched image and the image to be queried are similar instead of being definitely identical, the sentences directly generated in this case may not accurately describe the content of the image to be queried.
In recent years, deep neural networks have been applied to image retrieval [27] and machine translation [28] with success. Inspired by this trend, a variety of image caption generation algorithms based on deep neural networks were proposed, followed by great breakthroughs. This type of algorithm extracts image features using CNN, and further decodes image features into fluent sentences using RNN [29]. Unlike module-based matching algorithms or migration-based algorithms, the neural network-based algorithms not only eliminate the limitation of sentence modules, but also generate novel sentences not available in current databases, which is due to the characterization capabilities of CNN combined with the efficient modeling capabilities of RNN for variablelength sequences. In a novel parallel fusion LSTM structure [30]. It adopts hidden states which are based on two parallel LSTMs to make attributes and visual image information complementary and enhanced at each time step. An innovative structure eliminates the redundancy that exists in the training set, increasing adaptive weights to increase the ability to generalize captions [31]. A more sophisticated attention mechanism [32] is employed to extract salient region features. It combines sentence-level attention models with word-level attention models to generate more accurate captions. Exploring region relationships [33] implicitly explores the relationship between related semantics and dynamically searches the related visual relationships between multiple regions, making the description of image captions more accurate. Attributedriven image captioning model [34], which selects a specific area of the image and then decides which Attribute to focus on. This improves the coverage of visual attributes. The excellent performance of Bi-LSTM in machine translation makes many tasks try to use bidirectional LSTM. In Automatic language identification task [35], Bi-LSTM effectively extracts ''future'' speech sequences, and the effect is remarkable.  In the sentiment analysis task [36], Bi-LSTM can effectively extract the context information and obtain more accurate prediction results. Many studies have demonstrated that features can be extracted efficiently using Bi-LSTM. In the implicit discourse relation recognition task [37], the discourse arguments are encoded by Bi-LSTM to preserve contextual information, and the final result is better than the performance of LSTM. In the event detection task [38], the algorithm adopts the Bi-LSTM model to capture contextual information, and the final result is better than the result of LSTM.

III. PROPOSED METHOD
A bidirectional LSTM is introduced as a decoder in the image caption, which efficiently extracts contextual information; meanwhile, the F-LSTM is aligned with the B-LSTM via subsidiary attention, followed by semantic complementary output. The following elaborates on our model, as presented in Fig. 3. The hidden state extracted by the fixed forward LSTM and the hidden state of the backward LSTM is semantically aligned by the similarity module.

A. ENCODER-DECODER
When an image I is given, the image caption aims to generate a sentence Y = {y 1 , y 2 . . . , y T } for describing the image. Thus, its purpose is to maximize the probability of the formula: θ represents the image caption parameter, typically applied in the chain rule to model the joint probabilities: log p(y t |I , y 1:t−1 ).
We employ a unified encoder-decoder framework to generate captions: Encoder-CNN: The pixel value of the input image I has been fixed, and the input image I is encoded as a spatial vector using CNN, where V = CNN (I ) is CNN function to obtain the spatial feature V = {v 1 , v 2 , . . . ., v z }. z represents the number of image space regions, and v i ∈ R D denotes the image space region.
Decoder-LSTM: The conventional recurrent neural network (RNN) comes with the issue of gradient vanishing while processing time-series tasks; thus, we adopt the long short-term memory (LSTM) to replace the conventional RNN as the encoder. Compared with RNN, LSTM adds three threshold units (input gate, forget gate, output gate) to control the flow of data. The forget gate connects the hidden state h t−1 of the previous moment with the input x t of the current moment as the total input of the sigmoid activation function to generate the forget mask f t . The product of f t and the memory information c t−1 of the previous moment achieves the purpose of removing the previous moment's worthless information. The input gate computes the output mask i t using the same way, and employs i t to filter the memory informationc t at the current moment. Once c t−1 andc t are filtered, they are further summed to obtain comprehensive memory information c t . The output gate computes the output mask o t following the same way as the first two threshold units, and the comprehensive memory information c t is multiplied with o t after going through the tanh activation function to obtain the hidden state h t at the current moment. The computational procedure is as follows:

B. VISUAL ATTENTION GUIDE
In order to make the most of semantic and visual information, we incorporate the two using soft attention [11] in LSTM. The primary task is to properly integrate semantic and visual information. Second, more focus is paid to different time steps under the two elements of information. As a result, the visual output shifts from the same global image features to changing image local features as each word is generated. Attention dynamically extract the attention from images in response to changes in the visual context. It is defined as follows: W z ∈ R 1×k 1 , W v ∈ R k 1 ×k 2 and W h ∈ R k 1 ×k 3 represent the trainable parameter (transition matrices), W v is denoted as drawing the visual feature V i into a visual feature map. W h refers to plotting the semantic feature h t as a semantic feature map.
It is normalized via softmax, thereby generating the attention weight distribution.
V t represents the generated visual attention feature.

C. BI-LSTM
The conventional LSTM simply predicts the output of the next moment based on the temporal information of the present moment. However, the output of the current moment is relevant to both the state of the previous moment and the next moment. Predicting the exact word in a sentence, for instance, should be judged not only on the prior text but also on the following content, thereby realizing proper judgments based on context. The a s t represents the weight of similarity degree in the forward and backward direction at time t.
The summation of h The loss function of the bidirectional LSTM includes: L f XE and L b XE stand for the loss functions of the F-LSTM and B-LSTM, respectively. The conventional cross-entropy error training strategy is employed, with L as the final loss function. There is a distinction between training and testing conventional image captioning models, with testing relying on words previously generated by the model. When the preceding period's results are incorrect, the errors would accumulate and succeeding words cannot be generated correctly. To address these issues, we approach image caption production as a reinforcement learning problem, directly optimizing sentence generation based on the model's evaluation metrics, with the ultimate goal of minimizing the following negative expected returns: r(Y s ) serves as the reward obtained via CIDEr [39], BLEU [40] and other computing methods when the prediction is over; besides, LSTM updates its internal hidden state attention weights and other states.
The gradient can be approximated by the following formula: y * represents the baseline score obtained at test time with beam search decoding.

IV. EXPERIMENTS
In order to demonstrate the effectiveness of the proposed bidirectional LSTM model, we perform extensive experiments to test the model while also comparing it with the advanced models. The elaborated material of the experiment is included as follows, ranging from the dataset to assessment metrics, implementation details, and testing approach.

A. DATASET
We evaluate our model on the widely-used mscoco [41] dataset, which acts as a large-scale dataset with diversified object identification, segmentation, and captioning; each image is collected from daily life, making it the primary experimental dataset for image captioning. the individual image contains a multi-entity target with five manual labels for labeling the caption. this dataset includes 91 targets, 328,000 images, and 2.5 million labels. the largest dataset with semantic segmentation provides 80 categories, over 330,000 images, 200,000 of which are annotated, and over 1.5 million individuals in the entire dataset. we adopt 110,000 photos for training, 5,000 images for validation, and 5,000 images for testing [1].

C. DATA PRE-PROCESSING
In this paper, we implement a bidirectional LSTM with a subsidiary attention mechanism. Our parameter settings and experimental details are as followers.
First, we replace all words in the dataset with lowercase, and truncate the sentence caption length to 16, among which the words with a frequency of less than or equal to 5 are deleted, and finally a word list with a number of 9500 is obtained.
Second, we adopt the pre-trained Resnet-101 [44] to encode the image in the encoding phase, which encodes the image into a visual feature map of size 14 × 14 with 2048 dimensions. The visual feature map is mainly applied to represent the fine-grained information of the image.

D. DECODING PHASE
We employ a Bi-LSTM-s structure to decode visual feature maps into image captions with word embedding of dimension as 512. The forward LSTM, backward LSTM, and attention dimension are set to 512.
Finally, during the training phase, we train our model with the Adam [45] optimizer under the cross-entropy loss. We fine-tune Resnet-101's last convolutional layer to adjust the appropriate training parameters. The learning rate is 1 × 10 −5 , which decreases by 0.5 every six epochs. The batch size is set to 64, and the model is trained for 30 epochs. Subsequently, building on the training model, we employ reinforcement learning-based methods in order to optimize the CIDEr assessment metrics. At this phase, the learning rate is set to 5 × 10 −5 , the batch size is set to 64, and the training runs for 30 epochs. During the training phase, we evaluate our model on the validation machine at the conclusion of each epoch and save the model with the best current result. Then, the next phase of training will continue on the model with the best performance from the previous phase. In terms of testing, we select the model with the greatest CIDEr score on the validation set, and we utilize beam search to produce phrases with the beam size set to 5.
If the performance fails to improve after 6 training epochs, the training would be terminated.

E. EXPERIMENTAL RESULTS AND ANALYSIS
We design ablation experiments to evaluate the effectiveness of our proposed model in image captioning; all metric scores are designed using the MSCOCO Karparthy test segmentation.
First, as shown in Table 1, the improvement brought by CIDEr optimization is validated on our model. XE represents training under the cross-entropy loss function, and RL indicates the result of optimizing the scoring index based on the optimal XE training model. Second, two sets of models, Bi-LSTM and Bi-LSTM-s, are set to verify the effectiveness of our auxiliary attention mechanism. Bi-LSTM solely employs the visual attention mechanism, while Bi-LSTMs incorporates S-Att into Bi-LSTM. The following training results ensure that the parameters are consistent in order to maintain fairness.
As illustrated in the table 1-5, each represents a different ablation experiment. In Tables 1, we demonstrated that reinforcement learning improves significantly in image caption tasks. In Tables 2 and 3, we proved the superiority of the our model. In Table 4, we verified the effect of averaging and taking the maximum input on the result when semantic fusion occurs. In Table 5, it is the influence of different hyperparameters on the results of the experiment.
The following evaluation criteria: B@1, B@2, B@3, B@4, M, R, S, C, represent BLEU1-4, METEOR, ROUGE-L, SPICE, CIDEr. BLEU: It calculates the similarity and penalizes sentences of insufficient length. METEOR: It focuses on the number of co-occurrences of words and establishes a penalty mechanism based on word order changes to get scores. ROUGE-L: The similarity is measured by calculating the longest common sequence between the predicted sentence and the standard translation. SPICE: It encodes images into objects, attributes, and relationships, and then selects the highest scoring statement based on scene graphs. CIDEr: It calculates the similarity, which is based on the frequency of the words.Firstly, as shown in Table 1. The cross-entropy loss function is compared to the optimized CIDEr score (using the CIDEr score as an example), the CIDEr score of our Bi-LSTM model climbed from 112.5 to 117.9 by 5.4; the CIDEr score of our Bi-LSTM-S model climbed from 118.6 to 121.3 by 2.7, indicating that the current leading methodologies come with a significant enhancement in optimizing CIDEr on the basis of cross-entropy error. Second, the Bi-LSTM-s improved from 112.5 to 118.6 by 6.1 in the cross-entropy loss function experiment, and 3.4 improved from 117.9 to 121.3 in the optimized CIDEr experiment. Thus, it reflects that our subsidiary attention can efficiently extract, align, and produce finer captions from the semantic relations of the forward LSTM and backward LSTM.
Secondly, as shown in Table 2. Our ablation experiments are primarily utilized to validate our model's superiority, with constant experimental training parameters. To be specific, p-LSTM denotes the simultaneous superposition of two layers of structure processing, in which the hidden stateh 1 t of the first layer of p-LSTM is learnt andh 1 t is transferred to the second layer of LSTM; then, the input gate, forget gate, and output gate of the second layer of LSTM will all employh 1 t as input. The following uncovers the final hidden state: The final hidden state at time t is derived from the first layer's hidden stateh 1 t and the second layer's previous momenth 2 t−1 . According to the optimized CIDEr score, the Bi-LSTM model has improved by 10.3 points. In our model, our hidden state computation h t is related not only to the current input, but also to h f t−1 and h b t+1 . Bi-LSTM considers previous and future information simultaneously; thus, it truly achieves context-based output.
Aiming to demonstrate the superiority of our model, we evaluate it against eight measures and six prominent methods. As shown in Table 3. First, the foundational model is established. The most typical model, NIC, does not include an attention mechanism. The goal of Soft-Attention is to introduce a soft attention mechanism into difficult tasks. The attention mechanism is extended from spatial to channel by SCA-CNN. SCST is the application of reinforcement learning to the optimization of sentence-level rewards. Second, the pLSTM-A-2, DAIC and our model are improved on the   basis of the above models. pLSTM-A-2 encodes images using two separate encoders (MIML and CNN) and simultaneously merges the semantic information of the two decoders, resulting in more accurate and richer captions.
DAIC extracts the encoder's image input to sentencelevel and word-level attention respectively, while the final output combines sentence-level and word-level information to generate more accurate captions. Our model employs bidirectional LSTM as the decoder, accepts both past and future information simultaneously, and truly achieves prediction based on contextual information. It also employs two attention mechanisms, one of which will dynamically extract visual information accompanied by integrating visual information with semantic information; another auxiliary attention aligns the semantic information of the bidirectional LSTM, contributing to more diversified semantic information. Our model has displayed significant advantages in scoring.
As shown in Table 4. Considering the fusion of the forward and backward hidden states: Max is the maximum value of and h b t+1 . The data reveals that the outcome of taking the average is slightly better. When fusing forward and backward semantics using Max, simply considering forward or backward to obtain a single result causes insufficient semantics and the loss of partial semantics. On the other hand, using Mean considers the shared scope of forward and backward while retaining the original semantic information, thereby achieving fused semantics. Table 5. Under the combined effect of dual attention, an oversized selection of λ results in the extraction of unaligned semantics before-and-after, worsening the caption result. Yet, a small selection of λ leads to excessive reliance on similarity; the prior semantics over-absorb the following semantics, degrading the caption result.  Fig. 4 depicts the visualization results, which allow us to better represent our proposed approach, including the ground truth. F-LSTM, B-LSTM, and Bi-LSTM-s fused with semantic features based on contextual output. It also displays the interaction among Visual-Att focusing on image key regions and text dependencies, coupled with the extraction of keywords using an auxiliary attention mechanism. Moreover, a visualization is presented at the same time. All of the image elements are derived from the MSCOCO dataset It reveals that our model can effectively extract finegrained information, such as ''polka dot'', ''wooden benches'', and ''red chairs''. F-LSTM extracts ''polka dot,'' and the semantic features are fused into Bi-LSTM. F-LSTM extracts ''wooden benches,'' with B-LSTM extracting ''red chairs,'' and the one is complemented with another for output. S-Att extracts ''girl'' and ''women,'' presenting a dependency of 0.85; then, they are fused to complement the output. Also, ''surrounded'' and ''topped'' have a dependency of 0.62, while the two are fused to complement the output. Fig. 5 depicts the fine-grained information extracted by our model. We set Bi-LSTM as the control groups. The subsidiary 140 VOLUME 11, 2023   attention mechanism effectively complements the forward and backward output hidden states with progressive output to obtain fuller semantics, such as ''very'', ''red'', and predicting the fine-grained information ''stainless steel stove'', the action will be more comprehensively such as ''leaning against''. Fig.6 shows that all the photos are taken from real life, and our model can extract fine-grained information.

V. CONCLUSION
At present, the existing mainstream models simply take into account the impact of the previous information on sentences.
A model Bi-LSTM-s is hence created to efficiently extract past and future information in order to fully extract context information. Specifically, Bi-LSTM-s encodes the sentence context as hidden states of F-LSTM and B-LSTM, respectively. After that, S-Att obtains the word similarity between the hidden states via the attention mechanism, performing semantic alignment, semantic complementarity, as well as semantic fusion output. With extensive experimental analysis achieved on the MSCOCO dataset, our model allows us to fully extract contextual information, together with finegrained information. Furthermore, we demonstrate the superiority of this strategy using a range of evaluation metrics.
However, bidirectional LSTM still has its limitations. First of all, bidirectional LSTM has too many parameters, which may lead to prediction time delay for real-time tasks. Secondly, two basic LSTM cells still work inside the bidirectional LSTM, and the GRU with fewer parameters can be considered to replace the LSTM during training. At present, most training features encourage the output of words with high frequency priority, which leads to the restriction of semantic information. The further study, we will focus on generating different constraints to produce fine-grained semantic information from a global perspective.