Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?

Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. More speciﬁcally, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a speciﬁc attention-based reasoning module which is trained on the frozen visual representations to be evaluated, in a spirit similar to standard feature evaluations relying on shallow networks. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth. Our main ﬁndings are two-fold. First, despite excellent performances on classical proxy tasks, such representations fall short for solving complex reasoning problem. Second, object-centric features better preserve the critical information necessary to perform visual reasoning. In our proposed framework we show how to methodologically approach this evaluation.


Introduction
Visual representation learning has gained a lot of attention thanks to advanced frameworks demonstrating unprecedented results without relying on explicit supervision.On the one hand, classical self-supervised methods [1,2,3,4,5] producing localized features (i.e., features that densely correspond to regions of the image) are exhaustively evaluated on standard tasks like image classification or object detection where they perform on par with the supervised ones.On the other hand, more recent unsupervised systems [6,7,8,9] aiming at learning object-centric representations (i.e., each feature is associated with an object in the image) are typically evaluated for instance segmentation where benchmarks are saturated.
In this work, we investigate how well off-the-shelf features extract meaningful information about the objects in a given image.We propose to evaluate the features' ability to model objects through the performance of a reasoning module trained for Visual Question Answering (VQA) by introducing a new VQA evaluation protocol which is based on a simple attention-based reasoning module learned on top of the frozen visual features to be evaluated.Similar to feature evaluations that use shallow networks for predictions, we aim to decouple visual extraction from reasoning.For a fair comparison, we limit the size of the visual features to a small and fixed number.Additionally, we investigate the effect of the training set size to discover what is the minimum size required to be able to learn reasoning patterns and if some types of visual representations are better than others at preventing learning spurious correlations.
Our VQA evaluation enables us to compare off-the-shelf features, either densely extracted local features or object-centric representations and to make several findings that are the main contributions of this paper: First, such representations have excellent performances using classical proxy tasks, their VQA performance in our constrained setup is far from the one attained by the ground truth or the state-of-the-art obtained using dedicated architectures.With this observation, we stress the importance of using complex reasoning tasks, such as VQA, as a complementary evaluation protocol for testing the effectiveness of off-the-shelf image representations.Second, we find that object-centric features are more suited for visual reasoning than local features.Although this is conceptually expected, we provide an empirical way to exhibit this behaviour.Finally, having a limited training set size prohibits learning correct reasoning patterns with all visual representations, although explicit representations seem to be better at preventing learning spurious correlations than implicit ones.How many small cylinders are there?Our goal is to examine to what extent different off-the-shelf image representations are capable of encoding the information needed for reasoning.We examine popular off-the-shelf features that we split into two groups: (i) the classical dense sets of features localized on a grid-like structure, which we refer to as LOC, and (ii) objectcentric features which can be associated with objects in the scene, denoted here as OBJ.In the case of learned local features, we further distinguish 2 types of learning processes.We examine the standard approach of transferring features obtained from backbones pre-trained for classification on ImageNet [10], which we denote by IN.We also study features obtained through more recent self-supervised learning frameworks, denoted by SSL.

Evaluation framework
Pipeline overview We build our pipeline upon the disentangling reasoning from the vision and language understanding paradigm [11].Our framework serves as plug-and-play for unimodal encoders and consists of 3 separate modules, for text, vision and reasoning respectively.First, given a question-image pair, we use frozen text and visual encoders to extract features and map them to a common multimodal space.Then, the concatenated features are fed to the reasoning module predicting the answer.An overview of our pipeline is presented in Figure 1.
We use a predetermined and fixed text encoder to ensure a fair comparison of the visual features.Precisely, we use the RoBERTa language model introduced in [12], however, we note that this could be any other language model that produces a question representation as a sequence of text tokens.
The visual module maps an input image to a set of visual tokens V in a two-step process.First, an image encoder extracts a visual representation which is seen as a sequence of feature tokens.Then, to ensure that all features are given the same complexity, we design a module called memory adapter whose goal is to convert the extracted representations to a fixed-size input.Inspired by the state-of-the-art model MDETR [13], our reasoning module is a transformer encoder-decoder [14] which operates on the text and visual token sequences Q and V. To allow our reasoning module to distinguish between modalities, we add to each token a modality-specific segment embedding which is a learnable parameter.In the case of local features, we also add a learned positional encoding to incorporate spatial information, as they do not inherently have such information.
Visual memory adaptation To fix the input memory size of V, we apply a two-step process.First, depending on whether the visual encoder is OBJ or LOC, the sequence length N v may differ.Thus, we restrict the N v to be roughly equal to the maximum number of objects in a scene.Let us assume we have K maximum number of objects that can appear in a scene.Therefore, if N v > K, which is the case for LOC due to the grid-like structure of the local features, we apply adaptive average pooling on the features until we roughly match size of K.
Second, to ensure the reasoning module operates on compact representations of similar sizes in memory, we constrain the dimension of the visual tokens.Let us assume d min v is the minimum visual token dimension needed to solve the task, which corresponds to a number of objects' properties as well as their respective positions in a scene for relational reasoning.If d v > d min v we compress visual tokens using Principal Component Analysis (PCA) and decrease the dimensionality to match d min v .Therefore, we define the minimum memory size M required to solve the task as: The memory adapter converts the output of the visual encoder to a fixed memory size within a few orders of magnitude of M by relaxing the d min v constraint since visual features are not expected to attain perfect compression of the visual information.Training strategy We train our reasoning module using question-image-answer triplets (q, i, a) without any external supervision.Concretely, given a triplet we predict the answer type ŷt as well as the corresponding answer encoded by ŷb , ŷcnt or ŷattr respectively associated to True/False, count and attribute questions.Our final loss is defined as: L total = L t + L b + L cnt + L attr , where L t , L cnt , L attr denote cross entropy losses between the ground truth y t , y cnt , y attr , and the predictions ŷt , ŷcnt , ŷattr , whereas L b is a binary cross entropy loss between y b and ŷb .
3 Experiments  [13] 99.7 99.3 99.9 99.4 99.9 99.9 Table 1: Results on the CLEVR dataset.We report scores on the validation set with a detailed split by the question type.Overall, even in the larger memory setup, there is a significant gap between off-the-shelf features and the state of the art MDETR [13], which performs on par with ground truth.
For our comparative study, we choose popular image representations trained with various levels of supervision.Models are always frozen during training and used solely as feature extractors.We use features extracted from two popular architectures: convolutionalbased ResNet-50 [15] and the transformer-based ViT-S proposed in [16].To study the influence of the level of supervision in image backbones, in the case of ResNet-50 we evaluate the local features trained in a supervised manner using the classification task on Imagenet, denoted LOC-IN, and self-supervised DINO features, denoted LOC-SSL.For ViT-S, we use features trained with DINO.To evaluate the performance of OBJ we consider two methods: Slot Attention [8] and DTI-Sprites [9] which demonstrated state-of-the-art segmentation results on the recent CLEVRTex benchmark [17].
We conduct our analysis on the CLEVR dataset [18] -a common synthetic benchmark for Visual Question Answering.CLEVR world consists of 3D objects in different shapes, colours, sizes and materials.The maximum number of objects in scene K is 10.Therefore, the minimum memory size of visual tokens M = 10 × 7, for 10 objects, 4 properties and 3D position (x,y,z).
Since all considered representations are much larger than M , we consider 2 memory sizes in our protocol: (i) a total memory of size 100, called 100 mem size, which approximates the min M for CLEVR dataset, and (ii) a total memory of size 1000, called 1000 mem size, to account for richer representations than the bare minimum required for solving the VQA task.
All the details on adaptation and implementations can be found in Appendix 5.1.

Results and discussion
We conduct 2 experiments.In the memory size constraint test, we study the effectiveness of different visual features in solving the VQA task when heavily constraining the input size of the reasoning module.The overall accuracy as well as all the question-specific categorical accuracy are shown in Table 1.Moreover, we study the effectiveness of visual features in solving VQA with a limited number of samples.The idea behind this experiment is to test the generalization capacity of the features as we argue that accurate and discriminative features should generalize better to unseen samples.Figure 2 shows a comparison for varying fractions of the training set for both memory sizes.
Are visual representations anywhere close to ground truth performances?From the 100 mem size setup, it is clear that all of the methods except for DTI-Sprites fail to attain accuracy which further suggests that they do not encode visual information in a compact way that is comparable to perfect visual information.We attribute the overall best performance of DTI-Sprites in this restrictive memory size regime to its inherent explicit representation nature.The original memory size of a DTI-Sprites visual token is the closest to the ground truth, which plays in its favour in this experiment.Do visual representations contain sufficient information for solving VQA?In the more relaxed memory constraint setup (1000 mem size), all the studied visual representations enable the reasoning module to learn to solve VQA to some extent, with Slot Attention clearly outperforming the rest.This may suggest that even if the visual information is not encoded in a way that can be heavily compressed, these features nonetheless contain significant semantic information.To study the effect of introducing memory constraint, we also train our reasoning module using original feature sizes.In the case of DINO ResNet-50 features, we obtained 87.9% overall accuracy, which indicates that our memory adapter does introduce a bottleneck.Nevertheless, features at their best are still far from the ground truth performance.
Are object-centric representations more suitable for the task?For the exist questions, OBJ features are outperforming LOC ones.This is expected and quantitatively indicates that structuring the scene into a set of objects is much more suitable for reasoning.When it comes to comp num, query attr and comp attr, Slot Attention performs significantly better compared to other methods.We argue this can be attributed to the object-centric nature of the representation that facilitates comparisons among objects and focuses on describing their properties.We note that OBJindeed were trained on CLEVR contrary to LOCfeatures.However, we tried fine-tuning DINO features on CLEVR dataset, but we obtained worse results.We hypothesize they are not suited for synthetic datasets like CLEVR, which is smaller and much less diverse than standard SSL datasets.

Are visual representations able to reason from a few examples?
We observe that starting at 20% of the full training set and with perfect visual information corresponding to the ground truth, it is possible to infer reasoning patterns.However, at this training set size, none of the considered visual representations enables visual reasoning, regardless of the memory constraint.
Are compact and explicit visual representations better suited for learning to reason from a few examples?Looking at the 100 mem size setup (Figure 2a), we can see that DTI-Sprites is able to obtain higher accuracy with much fewer examples than the other methods, whereas all achieve comparable very low training losses (see 5.3).This may indicate that explicit representations enable a quicker discovery of relevant information than implicit representations.Do object-centric representations prevent learning spurious correlation?Figure 2b suggests that both OBJ and LOC representations exhibit similar behaviour when the memory is less constrained.Given that they all reach similar very high training accuracy, this indicates that structuring the representation into objects may not be as good at preventing learning spurious correlations as is the distinction between explicit and implicit representations.

Conclusions
We investigated to what extent off-the-shelf representations model the information necessary to perform visual reasoning.To that end, we design a new feature evaluation protocol based on VQA which aims at disentangling as much as possible the vision from the reasoning part.Using our evaluation protocol, we make three key findings: (i) off-the-shelf visual representations are far from being able to structure visual information in a compact manner, (ii) object-centric representations seem to be better at encoding the critical information necessary for reasoning, and (iii) limiting the training set size has a dramatic impact on the learning of spurious correlations.While these findings contrast with the excellent performances that off-the-shelf features usually obtain in simpler vision tasks, they also show that having representations that encode object properties is a promising first step towards unsupervised visual reasoning.

Dense local features
We use features extracted from two popular architectures: convolutional-based ResNet-50 [15] and the transformer-based ViT-S proposed in [16].For ResNet-50, we use the local features after the last convolutional layer right before the global average pooling, whereas for ViT-S, we use the output tokens of the last layer corresponding to the image patch position (i.e., the CLS token is not used).To encode the position of local features, we use the corresponding 2D position in the feature map for ResNet-50 features and the 2D position of the corresponding patch for ViT-S features.
For ResNet-50 pre-trained on ImageNet we use model weights available in torchvision package 2 .For both DINO ResNet-50 and DINO ViT-S/16 we use weights provided in original DINO repository 3 .

Object-centric representations
To evaluate the performance of unsupervised object-centric representations, we consider two methods: Slot Attention [8] and DTI-Sprites [9] which demonstrated state-of-the-art segmentation results on the recent CLEVRTex benchmark [17].We first train both methods on CLEVR dataset -which does not require any supervision -and use pre-trained models as feature extractors.

Slot Attention
In Slot Attention we use the slots right before the decoder part as features.We use the implementation available in CLEVRTex benchmark repository 4 .We train the model on the original, CLEVR VQA dataset, preserving the original train/validation splits.To keep a similar resolution as in the case of the multi-object segmentation task, we feed input images resized to 120 × 160.We do not apply any centre cropping to make sure all the objects in the scenes are clearly visible.Following [17] we use 11 slots, we also maintain the original learning rate, batch size, and optimizer settings as well as 500k iterations of training.

DTI-Sprites
We use original implementation of DTI-Sprites 5 .Similar to Slot Attention, we train the model on images resized to 120 × 160 with no centre cropping.To account for smaller resolution compared to the original multi-object segmentation task, we increase the representation expressiveness in the backbone by changing adaptive average pooling to 4×4, instead of the originally proposed 2 × 2. We train the model with 10 layers, corresponding to a maximum number of objects in the CLEVR dataset.We also increase the number of prototypes to 8 since we observed that the training did not lead to obtaining a complete set of prototypes in the dataset when using only 6.We train the model with the original batch size and learning rate for 760k iterations.

Memory adaptation details
To match the memory constraints in 2 setups for all visual encoders we use PCA implementation in Scikit-learn [19].We first extract features for train and validation sets and apply PCA offline.

Training details
We implement our framework in PyTorch [20].For the text encoder we use RoBERTa6 model available in HuggingFace library [21].In the case of all visual encoders, we follow the same training strategy.We train the reasoning module for 40 epochs, with a batch size of 64.We use adamW [22] optimizer with a learning rate of 1e-4, weight decay of 1e-4 and linear warmup for the first 10k iterations.We then decrease the learning rate at epochs 30th and 35th by a factor of 10.
Table 2: Memory size constraints imposed at the feature adaptation step.We conduct our studies in 2 memory size regimes: 100 mem size, and 1000 mem size.We also provide the raw output sizes for each visual encoder to highlight the extent of compression and expansion applied at the adaptation stage.method input image size raw output size 100 mem size 1000 mem size Computational cost Regardless of the visual encoder used as an input, a full training run takes approximately 20 hours on a single NVIDIA Tesla V100 GPU.In total we used approximately 15k GPU hours for both obtaining visual features as well as for the evaluation phase.

Details on the training set size study
In the training set study, we limit the number of training samples by reducing the number of scenes.We observe that starting at 20% of the full training set with perfect visual information corresponding to the ground truth, it is possible to obtain high validation accuracy.Therefore, we study the effectiveness of visual encoders at this threshold.
Figure 3 shows the learning curves at a 20% fraction of the training set for both memory size regimes.
In Figure 3a, which depicts the learning curves in 100 mem size, in the case of all of the methods we observe that the model starts to quickly overfit, with DTI-Sprites performing relatively better.
In the case of 1000 mem, Figure 3b size we observe the overfitting effect happening even earlier in the training process.This may indicate the occurrence of leakage from visual encoders to the reasoning part.
In both memory size regimes OBJ representations demonstrate relatively better generalization capabilities but are still far from the ground truth information.

Figure 1 :
Figure 1: Our evaluation framework overview.During training, only the parameters of the reasoning module are trained (above a dashed line), while the rest of the pipeline remains frozen.

Figure 2 :
Figure 2: Results of the few-shot VQA experiments on CLEVR dataset.

Figure 3 :
Figure 3: Training and validation curves at a 20% fraction of the training set.We report training loss for 20% of the training set being used, while validation loss we calculate over the whole validation set in the CLEVR dataset.