Imageability- and Length-controllable Image Captioning

Image captioning can show great performance for generating captions for general purposes, but it remains difficult to adjust the generated captions for different applications. In this paper, we propose an image captioning method which can generate both imageability- and length-controllable captions. The imageability parameter adjusts the level of visual descriptiveness of the caption, making it either more abstract or more concrete. In contrast, the length parameter only adjusts the length of the caption while keeping the visual descriptiveness on a similar degree. Based on a transformer architecture, our model is trained using an augmented dataset with diversified captions across different degrees of descriptiveness. The resulting model can control both imageability and length, making it possible to tailor output towards various applications. Experiments show that we can maintain a captioning performance similar to comparison methods, while being able to control the visual descriptiveness and the length of the generated captions. A subjective evaluation with human participants also shows a significant correlation of the target imageability in terms of human expectations. Thus, we confirmed that the proposed method provides a promising step towards tailoring image captions closer to certain applications.


I. INTRODUCTION
Image captioning shows great performance in generating captions for general purposes and receives great attention in the research community [15], [22], [43]. However, the requirements of different applications such as news articles, social media, assistive technology, and so on, can be largely different. It remains difficult to tailor the generated image captions to a variety of such applications. The reason is manifold: First, image captioning approaches usually target to generate captions close to those in existing training data, and then are evaluated based on their similarity to the testing data. Both the datasets and the evaluation metrics are made under the assumption of performing general-purpose image captioning. This generally results in a very low diversity of generated captions, as some research has tried to tackle [9], [39], [41]. Second, the perception and the style of the generated captions are rarely considered, although some research looked into captioning styles and sentiment [3], [11], [24] and the visual descriptiveness of captions [36]. Recent research towards caption diversification propose introducing parameters such as length-controllable models [7].
In this paper, we explore the diverse generation of image captions with two controllable parameters: imageability and length. First, imageability, a concept derived from Psycholinguistics [27] which describes whether a word gives a clear mental image, is used. Its usage for image-captioning has Imageability-and Length-controllable captioning model  been explored in our previous work [36], yielding promising results for customized image captions. In context of captioning, it can be used to adjust the visual descriptiveness of captions, making them being either a more abstract or more concrete description of the scene. Second, length provides another dimension of customizability for captions for different applications. While a news article might prefer a short abstract caption, a caption for assistive technology would be ideally longer and more descriptive. Further, by introducing two controllable variables, the proposed model can adjust both dimensions individually. The overall idea is illustrated in Fig. 1, showing how different settings for imageability and length can yield to vastly different captions. We believe that this step towards customized captioning can be a promising direction for application-tailored captioning. This research is based on our previous work published in a conference proceedings [36]. This initial work showed promising results for imageability-aware captioning with an LSTM-based architecture, yet yielding a still mixed correlation to human perception and often unnatural captions. In this follow-up research, we employ a transformer-based captioning model [46] in order to greatly improve the naturalness of the results, making it more viable for actual use in targeting different applications. A data augmentation method similar to our previous work is used to diversify captions for visual descriptiveness. Furthermore, a length-controllable parameter [7] is newly introduced, in order to allow for adjusting the generated captions along a second dimension. With this, our combined model allows for changing customization across two dimensions independently. Note that imageability and length encode different things; Changing imageability aims to change visual descriptiveness of the caption for the same length, while length aims to change the wordiness while keeping contents similar. As such, we believe the proposed method, being able to control them individually, is a great first step towards tailoring captions to single applications with different needs of contents and descriptiveness. The evaluations show a greatly improved performance when generating customized captions, beating comparison methods. Especially, a crowd-sourced subjective evaluation shows a significant improvement over our previous work [36], now closely correlating with the intended perception of the generated captions.
Our contributions can be summarized as follows: • We propose an imageability-and length-controllable image captioning framework which can create diverse captions closely tailored to various applications. • To the best of our knowledge, this is the first captioning framework which allows to adjust both imageability and length independently. • The evaluation shows a significant improvement over our previous work for imageability-aware image captioning, partially due to the introduction of the transformer-based model.

II. RELATED WORK
In this section, we discuss related work regarding image captioning and imageability. The related work on image captioning can be categorized into general-purpose image captioning and affective image captioning. While the former simply tries to summarize an image in a short sentence, the latter puts focus on attributes like emotion/sentiment, style, user-feedback, or descriptiveness. A rough overview of the introduced work is visualized in Fig. 2.

General-purpose image captioning
With the rise of deep learning-based models such as Long Short-Term Memory (LSTM) [14], general-purpose image captioning [16], [40], [43] achieved a great boost in performance.
More recently, transformer models [10], [37] using an attention mechanism have attracted researchers' attentions due to a very high performance in many natural language processing-related tasks. Following, many recent state-ofthe-art models for image captioning [18], [46], [47] make use of a transformer-based architecture.
Zhou et al. [46] combine a transformer model with attention on visual features extracted from images [18], [32] for image captioning yielding very promising performance. Most recently, Cornia et al. [5] and Pan et al. [28] added more sophisticated attention modules to further improve the performance of Transformer-based image captioning.

Affective image captioning
Rather than performing a neutral contents-based image captioning for general-purpose usage, there has been some research focus on image captioning in context of affective computing such as emotions and impressions [3]. They can be loosely categorized into four kinds of affective output: First, Mathews et al. [24] propose a method which allows for customizing sentiment, yielding positive or negative sentiment captions.
Second, Gan et al. [11], Guo et al. [13], and Zhao et al. [45] explore the generation of styles such as humorous or romantic, which is further extended in a transformerbased model [34] to concepts like sweet, dramatic, anxious, arrogant, and so on.
Third, a different approach has been investigated by Cornia et al. [4], which allows user-interactive captioning where the user can specify image areas to be explained in a caption as well as their order. Chen et al. [2] propose similar ideas where scene graphs are used to fine-tune customized image captions.
Lastly, some approaches [7], [36] target specifying the detail and amount of output. Deng et al. [7] propose a lengthcontrollable transformer model which can generate captions with fixed contents but a flexible length. In our previous work [36], we proposed a method for image captioning which can control the imageability of the generated captions. Imageability is a concept derived from Psycholinguistics first introduced by Paivio et al. [27], describing how easy it is to mentally imagine a word. It has received some attention in research for multi-modal analysis [25], [44], providing a promising opportunity to use it as a parameter for customized captioning.  In this research, we target the last discussed category of affective image captioning, proposing a method which allows for a high degree of customizability in descriptiveness of outputs. We build upon our previous work [36] on imageability-aware captioning using an LSTM-based model. We greatly improve the performance and naturalness of the generated captions by introducing a transformer-based captioning model [46]. As an additional parameter, we further introduce length-controllable captioning [7] to build a model which can generate captions with two independent parameters of customization.

III. IMAGEABILITY-AND LENGTH-CONTROLLABLE IMAGE CAPTIONING FRAMEWORK
In this section, we introduce the proposed framework for imageability-and length-controllable image captioning. For the imageability-controllable parameters, an augmented dataset with a high diversity in visual descriptiveness is VOLUME 4, 2016 needed. The augmentation and caption imageability estimation used in our method is largely based on our previous work [36], but briefly introduced in Sec. III-A due to this task being specialized and not yet receiving wide-spread attention. The proposed model itself is introduced in great detail in Sec. III-B.
A flowchart of the method is illustrated in Fig. 3.

A. DATASET PREPARATION
Following, we discuss the dataset needed for the proposed method. While the length-embedding of the framework is based on length-aware caption decoders as proposed by Deng et al. [7], the knowledge used for the imageability-embedding is trained on a diversified dataset. Thus, we first use a data augmentation technique to increase the number of captions in the dataset. The main focus lies on increasing the variety of visual descriptiveness of captions. Thus, we substitute information with more abstract terms, making captions more abstract for training. Next, the caption imageability is calculated for each caption, which is used for the imageability embedding during training.

1) Data Augmentation
Existing image captioning datasets such as Microsoft COCO [20] and Flickr30k [30] usually come with multiple captions for each image. However, there is typically not much diversity in terms of visual descriptiveness and each existing caption describes the image in a roughly similar way. For imageability-controllable captioning, we are interested in a large variety of descriptions, from abstract to visually descriptive. Imageability as a concept derived from Psycholinguistics [27] describes whether a word gives a clear mental image. For this research, we assume a rough relationship between visual descriptiveness and imageability, and thus use it to approximate a metric for visual descriptiveness. For a low target imageability, an ideal description would be something rather abstract, not mentioning many visual details. In contrast, for a high target imageability, a very detailed description of visual details in the caption would be expected.
To emulate this idea, the augmentation process substitutes words in existing captions with more abstract terms. With the help of the transformer architecture, the augmented data can then help the network to identify abstract language and how it would change captions. Similar to our previous work [36], each noun in a given caption is substituted by their hypernym according to its WordNet [26] hierarchy. We replace a noun with up to five levels of hypernyms in order to generate additional captions. Note, that we avoid going too close to the WordNet root node by removing the top-most two layers, as terms like object or item become too abstract for meaningful training. For captions with multiple nouns, we generate augmented captions for each noun separately. The idea is visualized in Fig. 4.  [26], we extract a hierarchy of hypernym terms for each noun in the existing captions. We pick up to five replacements for each noun, e.g., replacing pasture with the terms {area, location, region, field, grassland}. Note that we avoid replacements too close to the WordNet root node, as they would become too abstract. As such, grass will only be augmented by {food, plant}, but not with item or object which would come above. This process is repeated for all nouns in every caption to create an augmented dataset with more abstract wordings.

2) Caption Imageability Estimation
In order to learn the relationship between an image and the visual descriptiveness of a caption, we calculate the caption imageability. The basic idea is to use imageability values for individual words composing the caption in order to calculate a value representative for the whole caption. Existing imageability dictionaries such as [6], [31], [33], [42] describe imageability on a Lickert scale (e.g., on an interval of [1,7] or [1,5]) from very unimaginable to very imaginable.
For caption-imageability estimation, we follow the same approach as in our previous work [36]. We start with a caption from the dataset and assume available imageability labels for all its individual words. As this is a strong assumption, we skip stop-words, numerals and similar. For our experiments we target English language, which also influences some design decisions discussed onwards, but an adjusted process is expected to work for other languages, too. We generate a parsing tree using the Stanford CoreNLP [23] framework. Next, we employ a bottom-up approach which calculates a sentence imageability score from all its words' imageability values along the parsing tree. We assume nouns to become more descriptive when being modified by adjectives (e.g., "black cat" being a less visually ambiguous description than "cat"). For multiple words on the same level of the parsing tree, we define some simple rule set for weighting: 1) If there are one or more nouns, the last noun is the most significant and weighted the highest (e.g., "cold apple juice" are modifications of "juice"). 2) If there is no noun, the first word is the most significant and weighted the highest (e.g., "run fast" is a modification of "run"). We caculate the imageability of sub-trees using It is based on [7] which allows for length-controllable captioning. Inspired by their architecture, the proposed methods adds an imageability embedding layer which encodes the visual descriptiveness of captions. Using this, the resulting model allows both imageability-and length-controllable output.
where x i (i = 1, . . . , n | i = s) is the score of each modifying word and x s is the score of the most significant word. This process is repeated bottom-up until reaching the root node of the parsing tree. Lastly, the results are normalized using f (x) = 1 − e −x .
We employ this method and calculate the caption imageability values for all captions in the augmented dataset.

B. CAPTIONING MODEL
For the captioning model, we employ a BERT-based transformer model [46]. Deng et al. [7] apply this model for length-controllable captioning, where they add a layer of length-embedding to the language features. Inspired by this, we add an extra layer of imageability-embedding based on the augmented dataset with caption imageability estimations. Our proposed model is illustrated in Fig. 5.
First, we introduce each type of embedding and the features used for the training.

1) Length embedding
The length embedding is implemented in the same fashion as proposed by Deng et al. [7].
For a caption C = {c i } N i=1 , we assign C a length level with the range [L low , L high ] according to its length N . Then, the length-embedding matrix W l ∈ R k×d (with k being the number of length levels and d being the embedding dimension) is trained to differentiate image captions on different length levels.
A one-hot vector t l ∈ R d for the length l is generated. The length embedding is then defined as 2) Imageability embedding Inspired by the length embedding discussed before, we implement an imageability embedding in the same way. For each caption, we generate an imageability embedding based on the caption imageability estimation obtained in Sec. III-A. We assign an imageability level i to a caption within a range of (I low , I high ] according to its caption imageability I. Through this, the existing caption imageability annotations are binned into evenly-sized levels. The imageabilityembedding matrix W i ∈ R a×d (with a being the number of imageability levels and d being the embedding dimension) is trained to differentiate image captions on different imageability levels. t i ∈ R a represents a one-hot vector for the imageability level. Finally, the imageability embedding becomes

3) Visual features
The model applies a Faster-RCNN [32] network pretrained on the Visual Genome dataset [17] to extract visual features. Using this object detection model, M objects in the for each object in the image.
The visual features are then defined as

4) Language features
For an input caption C = {c i } N i=1 with c i representing each word in a caption, we use a BERT-based model [46] to obtain a word-embedding e w,ci ∈ R d and a location-embedding e p,i ∈ R d .
The length-and imageability-embeddings are added to the language features, which are defined as x ci = e w,ci + e p,i + e len + e imag .

5) Model training
The proposed model is based on the language generation model by Ghazvininejad et al. [12]. For a correct caption , which is randomly masked with tokens [MASK], the transformer network is fed with a masked caption C = {c i } N i=1 . Next, the pair of visual and language features is fed into the network, predicting the masked token. The model is trained by minimizing the cross-entropy loss between the correct token t i of the ground-truth caption and the masked-in token c i as expressed by Note that c i = [MASK] is an indicator function that is 1 only when l(·), and 0 otherwise.

6) Caption generation
Following Ghazvininejad et al. [12], we use the "Mask-Predict-Update" method to generate captions. Initially, the whole caption is masked with [MASK] tokens. The feature embeddings are fed into the transformer network in order to predict a mask position and its most suitable vocabulary. The process is repeated iteratively until the whole caption is generated.

IV. EVALUATION
In this section, we evaluate our proposed image captioning method. After discussing the environment in Sec. IV-A, we illustrate some generated captions of the proposed method in Sec. IV-B. Following, we evaluate the approach from three angles: First, Sec. IV-C discusses the performance of the model measured by general-purpose image captioning metrics. The length-controllable transformer-based method has already been extensively evaluated in [7]. Therefore, for the second and third experiments, we focus on a deeper evaluation of the imageability-controllable part of the transformer-based model and its differences over the previous LSTM-based work [36] for generating captions with different visual descriptiveness. As such, Sec. IV-D discusses the imageability diversity of the generated captions, and Sec. IV-E the performance in a crowd-sourced human evaluation.

A. ENVIRONMENT a: Datasets
We employ the Microsoft COCO [20] dataset as a baseline for the data augmentation. For training and testing, we use Karpathy splits [16]. The extended dataset is generated as discussed in Sec. III-A1, aiming for twenty captions per image. For the imageability estimation of captions, we employ two imageability dictionaries by Ljubešić et al. [21] and Scott et al. [33]. As the former is a large estimated dictionary while the latter is a small crowd-sourced one, we favor the ground-truth imageability of the latter dictionary in case of overlaps. Images which did not yield sufficient numbers of captions through data augmentation or did not have enough sufficiently available imageability word annotations were excluded from the experiments. We end up with 109,115 images for training, 4,819 images for validation, and 4,795 images for testing. For the controllable length parameter, we define four length levels: L-1 (length of [7,9]  We evaluate all combinations of L-x and I-x regarding their qualitative and quantitative results. We furthermore also evaluate a variant where we only use the imageabilitycontrollable features I-x and exclude the length-embedding. The reason for this is that the length-controllable transformer model have been already exhaustively evaluated in [7], while

Image Method Imageability Caption level Low
A placental is sitting on a window sill. Tell As You Imagine [36] Mid A feline is sitting on a window sill. High A cat is sitting on a window sill. Low A close up of a cat near a glass window sill.

Proposed method
Mid A vertebrate is looking out of a window. High A brown and white cat sitting on a window sill. Low A large brown canine laying on top of a beach. Tell As You Imagine [36] Mid A large brown canine laying on top of a beach. High A large brown dog laying on top of a beach. Low A close up of a canine laying on a beach.

Proposed method
Mid A carnivore laying on the ground in the sand.

High
A brown and white dog laying on a beach. Low An organism swinging a baseball bat at a baseball. Tell As You Imagine [36] Mid An organism swinging a baseball bat during a baseball game.

High
A baseball player swinging a bat at a ball.

Low
A concoction getting ready to swing at a pitch.

Proposed method
Mid A male is up to bat during a baseball game.

High
A baseball person holding a bat on a field.
the imageability-controllable part of the transformer model is a contribution of this paper.

c: Comparison methods
For comparison, we tested a selection of methods from related work on the same datasets. First, we want to understand how the performance of our imageability-and length-controllable captioning method compares to general-purpose captioning. Thus, in Sec. IV-C, we compare our results to a general-purpose method, "Show, Attend, and Tell" (SAT) by Xu et al. [43], the lengthcontrollable approach LaBERT by Deng et al. [7] (using their best-performing variant with L-2 for the comparison), as well as general-purpose methods X-Transformer by Pan et al. [28] and M 2 by Cornia et al. [5].
Second, we include our previous work "Tell As You Imagine" (TAYI) [36], which generates imageability-aware captions using an LSTM-based approach. This work is not trained on grouped imageability levels, but can generate individual values of imageability I = [0.5, 0.6, . . . , 0.9]. To yield a comparable output, similar to the way we defined levels in the proposed method, we generate captions for Low (with I = 0.5), Mid (I = 0.7), and High (I = 0.9). We use this as the main comparison method for experiments in Sec. IV-D and IV-E, as it is to the best of our knowledge the only related work tailoring its output to imageability.

B. QUALITATIVE EVALUATION
Before looking into the quantitative metrics, we showcase some examples of the output of the proposed method. Table 1 shows the output for an example image where imageabilityand length-parameters were adjusted at the same time. We can see that the customization works well in both dimensions, allowing for a promising way to tailor the model output to individual needs of applications. Note that this also results in a high caption diversity which could also be useful for many applications. To the best of our knowledge, there is no other method which can generate both imageability-and length-VOLUME 4, 2016 TABLE 3. Evaluation through general-purpose image captioning metrics. The proposed method is compared to [36] which is the only other related work aiming at imageability-aware captioning and [5], [7], [28], [43] in order to compare performance against general-purpose captioning models. Due to the very different style of captions generated for different levels of imageability, the scores are split into three groups, highlighting the average performance for a low, mid, and high target imageability. The bold values correspond to the highest value within the imageability-aware methods.

Method
BLEU-4 [29] CIDEr [ TAYI [36] is the only related work targeting imageabilityaware captioning. We compare it to our proposed model in Table 2. In this case, we excluded the length-embedding, resulting in results which roughly resemble those of length level L-2. As we can see here, the output of our method vastly outperforms this comparison method, making the results much more natural. This is mostly a result from the switch to a transformer-based architecture compared to LSTM used in the comparison method.
For length-controllable captions, LaBERT [7] provides an exhaustive analysis. As our architecture without the imageability embedding is largely identical to their setup, we thus skip a more detailed analysis of this parameter.
Overall, the imageability-aware models yield a reasonable performance across all metrics, despite the more recent general-purpose methods outperforming them. As the proposed method discusses a specialized task of imageabilityand length-controllable captioning, we did not expect to achieve the best performance in these metrics. Rather than performing the best, we want to aim for a reasonable performance while providing an additional dimension of customizability. Note that most of the evaluation metrics actually do not consider, but rather punish, diverse captions and style changes, as the evaluation is based on a direct comparison to a ground-truth annotation. As such, methods aiming for diversification or affective computing commonly slightly degrade performance in such metrics by their nature. The method by LaBERT [7] outperformed our proposed method in most metrics, but the results are close enough to verify a similar performance. As we were interested in generalpurpose performance, we used the best-performing variant (L-2) of their model.
Newer architectures such as [5], [28] further outperform the proposed method. Because of this, future research could investigate into whether these architectures could also be benifitial for imageability-aware captioning.
Note that the nature of the approach, actively purposefully changing contents of the output, would naturally decrease their performance in terms of these general-purpose image captioning metrics.
We can also see a great improvement over TAYI [36], which also aimed for imageability-aware captioning. Here, the proposed method outperformed the comparison method on all metrics.

D. EVALUATION OF IMAGEABILITY-CONTROLLABLE CAPTIONS
In this experiment, we evaluate the imageability-controllable captions. Here, we analyze the variety of the generated captions.
The results are shown in Table 4. We can see that the proposed method is able to yield an overall increased variety of captions. While TAYI [36] aims for generating individual results for imageability between [0.5, 0.6, . . . , 0.9], most will actually result in very similar or identical captions. Similarly, the range of output imageability is rather compact. In contrast, the proposed method can generate a higher variety of diverse captions, yielding up to five distinct captions (i.e., usually having individual results for each imageability level I-1 to I-5). Furthermore, the span of imageability is higher, leading to a perceptionally larger difference between the generated captions.

E. SUBJECTIVE EVALUATION
Lastly, in this section, we explore the human perception of the generated captions. As the imageability-controlled captions are expected to have a varying degree of visual descriptiveness, we are interested in whether this intended effect matches the perception of users when reading the caption. Following, we performed a crowd-sourced subjective evaluation where we asked participants to judge pairs of captions regarding how easy they are to visually imagine. Note that we do not include other related methods such as SAT [43] in the comparison, as those methods provide no meaningful way to generate multiple captions with different perceptions (such as visual descriptiveness). As such, we compare our results only to TAYI [36], which is the only related work with such a parameter. We generated three English captions each for 195 images, corresponding to the Low (I-1), Mid (I-2), and High (I-5) imageability levels as discussed before. Using Amazon Mechanical Turk 1 we asked participants to perform a Thurstone's paired comparison task [35], judging which caption is easier to visually imagine based on its textual contents. Note that we do not show the actual image, because we also want to see whether a high imageability might help making a caption more suitable for assistive technologies. For each pair, we asked fifteen US participants to obtain a meaningful majority decision. The human judgements were compared to the intended imageability values using Pearson's rank correlation. The results are shown in Table 5. The values in the right half of the table show the distribution of fully matching, half-matching, inverse-half-matching and inverse-fully-matching between our intended imageability and human perception. The avg. column shows the overall correlation for each method. The proposed method vastly outperformed the comparison method, resulting in an average correlation of 0.70 over a correlation of 0.36 in the comparison method. Note that the 95% CI column shows 95% confidence intervals for each method. As discussed  [36] which is the only other related work aiming at imageability-aware captioning. In the survey, participants were asked to judge the mental image of a pair of captions. The results show the correlation between the human perception of generated captions and the target-imageability. For this experiment, the length embedding is excluded, using only the imageability-controllable setting. before, TAYI uses an LSTM-based architecture while our method uses a transformer-based architecture, resulting in a well-improved performance. Together with the more natural results illustrated in Table 1, we believe that the proposed method provides a meaningful framework useful for many real-world applications.

V. CONCLUSION
In this paper, we proposed a transformer-based method to generate diverse image captions with two controllable dimensions: First, building upon our previous work on imageability-aware captioning [36], we use imageability as a parameter to change the degree of visual descriptiveness of a generated caption. Second, inspired by recent work on length-controllable captioning [7], we use length as another parameter to modify the length of a caption independent of the degree of visual descriptiveness. Imageability and length encode two different angles: Changing imageability aims to change visual descriptiveness of the caption for the same length, while length aims to change the wordiness while keeping contents similar. The resulting model is, to the best of our knowledge, the first model which can generate a variety of differently-perceived captions tailored to various applications.
In the experiments, the proposed method showed a promising performance for generating captions across different lengths and imageability values. A subjective evaluation with human participants verified a vastly improved performance compared to an existing method. This shows that the transformer architecture in combination with imageability as a prior can successfully learn the human perception of sentences regarding the degree of visual descriptiveness. For future work, it could be interesting to look into other Transformer-based architectures such as [5], [28].