Multi-Label Contrastive Learning for Abstract Visual Reasoning

For a long time, the ability to solve abstract reasoning tasks was considered one of the hallmarks of human intelligence. Recent advances in the application of deep learning (DL) methods led to surpassing human abstract reasoning performance, specifically in the most popular type of such problems—Raven’s progressive matrices (RPMs). While the efficacy of DL systems is indeed impressive, the way they approach the RPMs is very different from that of humans. State-of-the-art systems solving RPMs rely on massive pattern-based training and sometimes on exploiting biases in the dataset, whereas humans concentrate on the identification of the rules/concepts underlying the RPM to be solved. Motivated by this cognitive difference, this work aims at combining DL with the human way of solving RPMs. Specifically, we cast the problem of solving RPMs into a multilabel classification framework where each RPM is viewed as a multilabel data point, with labels determined by the set of abstract rules underlying the RPM. For efficient training of the system, we present a generalization of the noise contrastive estimation algorithm to the case of multilabel samples and a new sparse rule encoding scheme for RPMs. The proposed approach is evaluated on the two most popular benchmark datasets [I-RAVEN and procedurally generated matrices (PGM)] and on both of them demonstrate an advantage over the state-of-the-art results.


Introduction
Abstract visual reasoning tasks are considered a widely-accepted way of measuring human intelligence.The most popular example of such task are Raven's Progressive Matrices (RPMs) [25,26], where one is required to identify abstract relations between visually simple objects and their attributes (see Figure 1).The importance of RPMs in measuring human intelligence is justified by the fact that their solving requires an incremental strategy for inducing regularities in each problem [3].Previous works identified a major gap between the performance of Machine Learning (ML) algorithms and humans [28,37], which sparked interest in these problems within the deep learning (DL) community.In effect, human performance advantage has quickly vanished along with development of recent DL models [35].
However, the way ML algorithms solve RPMs leaves something to be desired.In order to solve these tasks, humans are required to come up with a strategy which correctly identifies all the underlying rules and differentiates them from distracting features, while the goal of the vast majority of ML approaches is to focus on selecting a correct answer, without explicit explanation for the output.Figure 1: Solving RPMs requires to identify abstract relationships hidden behind random visual distractors and contrast possible answers to select the one which fits best.Left RPM, although has more visual details, is governed by only a single rule (row-wise AND applied to shape position), whereas perceptually simpler right RPM contains 8 distinct relations applied to both outer and inner structures.The examples come from the PGM and Balanced-RAVEN datasets, respectively and in both cases the correct answer, which is to be placed in the bottom-right panel, is A.
Such a direct optimization problem formulation encourages neural models to simplify the solution process and rely on biases instead of understanding the internal problem structure.Indeed, such pathology was identified for the RAVEN dataset [37], where networks were able to arrive at a correct answer just by comparing the set of possible choices [14].Similar problems were noted throughout the visual reasoning literature [39,16].
A large body of cognitive literature identified contrastive mechanisms and the ability to create analogies as key ingredients for adaptive problem solving [7,13,29], which allow to apply previous experiences to novel domains.Naturally, there have been various approaches to replicate such behaviour in neural models either in the form of an explicit contrastive module [38] or through the way the data is presented during training [11], which in both cases resulted in notable improvements.Moreover, it was demonstrated that when models were additionally trained to predict a symbolic explanation for their answers (so-called auxiliary training), their generalisation capabilities increased substantially [28].Nonetheless, despite notable positive impact of the auxiliary training, the topic remains underexplored.
Motivation and Contribution.In order to build models which understand how to solve RPMs, we look for alternative training approaches.Encouraged by the effectiveness of both contrastive learning methods and symbolic explanations, we seek to develop a novel auxiliary training method for abstract visual reasoning tasks.We aim to exploit two recurring themes present in human approaches: the contrastive mechanism which differentiates between correct and wrong answers to the RPMs and the inherent ability to first identify all the abstract relations defining a given RPM and then use them to select the answer.The main contribution of this work is four-fold: • Approaching the problem of solving RPMs by casting it into a multi-label classification framework, where labels are determined by the underlying abstract rules.This viewpoint allows us to explore a novel training method for abstract visual reasoning tasks.
• Devising a new formulation of the Noise Contrastive Estimation (NCE) learning algorithm for the case of multi-label samples, whose application imitates human approach to solving RPMs.
• Proposition of a new sparse rule encoding scheme for RPMs which provides a more explicit rule representation compared to the method used in prior works.
• Integration of both contrastive and auxiliary training into a novel ML approach to solving abstract visual reasoning tasks, which sets new state-of-the-art results on two major benchmark datasets -Balanced-RAVEN [14] and PGM [28].

Related work
Raven's Progressive Matrices.Although RPMs are characterised by rather simple visual representation, solving them is often a challenging task, as it requires to correctly identify all abstract relations between the component RPM images.In order to measure generalisation ability of neural modules in relational reasoning problems, Santoro et al. (2018) introduced the dataset of Procedurally Generated Matrices (PGM), which contains RPM problems divided by the authors into train and test sets.An important part of the PGM dataset are meta-target annotations, which encode the relations between objects and their attributes in a given RPM.Models trained to additionally predict these meta-targets by means of an auxiliary training, were shown to posses stronger generalisation capabilities compared to those that did not employ such an auxiliary training.
The topic of meta-annotations was further extended in the RAVEN dataset [37], which contains additional structural annotations and RPMs with highly compositional structure.However, as reported in previous works, in the case of RAVEN data the auxiliary training seems to have little to no positive impact [37,38].Our investigations on Balanced-RAVEN dataset show that a more explicit rule encoding scheme can mitigate this problem.
A recent paper [14] put in question the abstract visual reasoning capabilities of reported models, by highlighting a major flaw in the process of generating candidate answers in RAVEN.Due to an unintended bias in the generated set of possible choices, models trained purely on answers were still able to select a correct solution.To mitigate this problem, the dataset Balanced-RAVEN was proposed, in which the answers are generated in an unbiased way [14].
Initial reports of machine RPM solving demonstrated a notable gap between humans and ML algorithms, which stimulated active research in this area.A number of methods were designed for solving RPMs, which aim to identify relations between sets of objects [28,40] using a Relation Network [27], reason with multi-layer multiplex graph neural network [33] or discover compositional representations using a scattering transformation [35].Other works investigate human-inspired approaches, by measuring feature differences [21] or exploiting the hierarchical structure of RPMs [14].

Contrastive representation learning.
Previous studies shown that the ability of making analogies and contrasting experiences is a key ingredient of human intelligence [7,13,15,20].Consequently, these concepts were also implemented in ML systems solving abstract reasoning problems.Hill et al. (2019) discussed how to make analogies by contrasting abstract relational structure, while Zhang et al. (2019) incorporated contrast directly in the model architecture.Motivated by these findings, we propose another approach to abstract reasoning, by incorporating contrastive mechanism directly in the objective function, similarly to contrastive learning implementations in other domains.
Among contrastive learning algorithms, the most related to our work is the Noise Contrastive Estimation (NCE) method [8,23], which was successfully utilized across various domains, such as image recognition [4], natural language processing [22] or reinforcement learning [19,31].NCE aims at building congruent representations for semantically related samples (positive pairs) and dissimilar representations for unrelated observations (negative pairs).It was shown that the concept of learning representations with NCE (as well as with other contrastive learning methods) is related to mutual information (MI) maximization principles [24,12,32].
Contrastive learning is particularly useful in semi-supervised pre-training methods, where the availability of labeled data in downstream tasks is scarce [4,5,6,10].A recent work [17] has shown that contrastive learning can outperform classical approaches also in fully supervised setting (with cross-entropy) and is characterised by greater robustness and stability to hyperparameter selection.This superior performance is a result of employing multiple positive pairs, which was further shown to increase MI lower bound by considering MI estimation as multi-label classification [30].Despite this multi-label viewpoint, the framework proposed by Song and Ermon, as well as the vast majority of prior works, focus on multi-class classification and don't discuss applicability of contrastive methods to multi-label problems.Our work aims at bridging this gap by defining a general contrastive learning framework which supports multi-label samples.

Multi-Label Contrastive Learning
Motivated by successful applications of contrastive learning methods in other domains, we aim to extend it for multi-label setting and investigate its usage in learning representations of abstract visual reasoning problems.We start by proposing a general multi-label learning framework and then discuss its applicability to solving RPMs.

Preliminaries
Our method builds on the foundations of Supervised Contrastive Learning proposed by Khosla et al. (2020) and extends it to support multi-label data.Given a randomly sampled batch {x i , y i } i=1...N ∈ {X × Y} of size N , the base method consists of the following integral components: • A data augmentation module, which transforms image x i into two randomly augmented views x2i and x2i−1 , leading to an extended batch {x i , ỹi } i=1...2N such that ỹ2i = ỹ2i−1 = y i .It should be noted that while this step is critical for Supervised Contrastive Learning its usage in our framework is optional.
• An encoder network f θ which forms latent representations of the augmented views, defined as h i = f θ (x i ).The representations obtained from the encoder are 2 -normalized, which encourages learning from hard negatives and hard positives [17] and simplifies the final linear classification task by aligning the features from positive pairs and uniformly distributing them on the hypersphere [34].
• A projection network g φ , which maps feature representation into a lower-dimensional vector z i = g φ (h i ) suitable for computation of the contrastive loss.The representations obtained from the projection network are 2 -normalized, which this time allows to measure similarity between two vectors based on their dot-product.The projection network is realized by a non-linear function, an MLP with a single hidden layer, which was shown to be of critical importance [4,5,6].
Both f θ and g φ are optimized jointly with respect to a Supervised Contrastive Loss, defined as follows: where N ỹi is the number of samples in a given mini-batch with the same label as the anchor i, 1 B ∈ {0, 1} is an indicator which evaluates to 1 iff B is true, sim(z i , z j ) = z i • z j is a similarity measure defined as dot-product and τ > 0 is a constant temperature parameter.After this pre-training stage, the weights of the encoder f θ are frozen and the projection network g φ is replaced with a randomly initialized linear classification head, which is trained with cross entropy on the downstream task.This procedure, commonly referred to as the linear evaluation protocol [24,1], provides a simple way to measure the quality of learned representations.

Multi-Label Contrastive Loss
In the default setting, the Supervised Contrastive Loss supports multi-class samples, i.e. y i ∈ Y.In order to extend it to multi-label samples {x i , Y i } such that Y i ⊂ Y, we propose a novel objective function, the Multi-Label Contrastive Loss, which is defined as follows: where | is the number of samples in a given mini-batch which share at least one label with the anchor i and L mlc i,j = L sup i,j .The difference when compared to eq. ( 2) lies in the definition of positive pairs for an anchor i.While the base formulation defines the set of positives as those samples with exactly the same label, eq. ( 5) defines it as those samples which share at least one label.Our formulation preserves key properties of the base objective, i.e. 1) aggregates an arbitrary number of positive samples in the numerator and 2) increases contrastive strength by using all negative samples in the denominator.At the same time, the modified definition of positive pairs allows to handle samples with multiple labels.Analogously to the base method, our framework relies on the inner product of 2 -normalized vectors as a measure of similarity.

Adaptation to RPMs
In order to utilise the above-proposed Multi-Label Contrastive Learning (MLCL) framework for solving RPMs, let us first observe that each RPM can be naturally viewed as a multi-label sample, where labels correspond to the rules governing the RPM.Let us consider a mini-batch {M i } i=1...N of size N , where M i = {X i , Y i , k i } represents the whole RPM instance composed of a set of 16 images X i , i.e. 8 context panels and 8 choice panels (out of which only one correctly completes the matrix).Y i ⊂ Y is a set of associated rules such that 1 ≤ |Y i | ≤ N Y and k i ∈ {1 . . .8} is an index of the correct answer.Here, Y is the set of all possible rules and N Y is a dataset-dependent maximal number of rules for a single RPM.
Data augmentation.Previous works on solving RPMs do not report the usage of data augmentation.On the other hand, augmentation was highlighted as a fundamental component of NCE-based learning in other domains, e.g. by Chen et al. (2020).In order to conduct a meaningful comparison with prior literature, in the experiments both setups with and without augmentation are considered.In the former case (with augmentation), for each RPM in the mini-batch we apply two randomly selected augmentations, which gives a mini-batch of 2N RPMs , where X 2i and X 2i−1 are both obtained by augmenting X i .All augmentations preserve both the underlying rules and index of the correct answer, hence: Since RPMs consist of greyscale images, we cannot rely on augmentation methods popular in image recognition.Instead, simple transformations which rearrange or rotate images are used, as showcased in Figure 2. If augmentation is not applied a base mini-batch {M i } i=1...N is used.
Contrastive pre-training.For each RPM, we construct 8 different matrices by filling in the remaining context panel with each of the choice panels.For each sample M i , this gives us a single RPM which satisfies the rules Y i and 7 RPMs which satisfy fewer rules, because of the incorrectly chosen answer.We arrive at a batch divided into two components, one containing correct RPMs with their rules {x i , Y i } i=1...2N and the other one composed of incorrect RPMs {x i,l } i=1...2N,l=1...7 .This division allows us to consider the incorrectly completed RPMs as additional negative samples,

Base
Horizontal flip

Base
Vertical flip Vertical roll

Shuffle 3x3
Figure 2: Augmentation of RPMs from the Balanced-RAVEN dataset.Selected transformation is applied in the same way to all images in a given RPM.The data augmentation module applies a randomly selected combination of presented methods with additional random rotation and transposition, using implementation from [2].For clarity, we only depict different views of a single row for two RPMs belonging to configurations 2x2Grid (left part) and 3x3Grid (right part), respectively.for which: We optimize both f θ and g φ with respect to the Multi-Label Contrastive Loss defined in eqs.( 4)- (7).
Auxiliary training.We support the contrastive pre-training with an auxiliary loss L aux introduced in [28].For this purpose, we employ a rule discovery network ρ (implemented as an MLP with a single hidden layer), which transforms outputs of the encoder f θ into a meta-target prediction , where d is the dimension of meta-target encoding.We compare two setups for the rule encoding: • Dense -multi-hot encoding scheme introduced in [28] and [37].It encodes each rule as a binary string of fixed length (d = 12 for PGM and d = 9 for Balanced-RAVEN) and performs logical OR operation on the whole set of rules.
• Sparse -our proposed scheme, which encodes each rule as a one-hot vector of fixed length equal to the number of unique relations in the dataset (d = 50 for PGM and d = 38 for Balanced-RAVEN).Similarly to dense encoding it performs logical OR operation on the whole set of rules.
Intuitively, the advantage of proposed sparse encoding over the dense encoding used in prior works stems from information lossless OR operation, i.e. from sparse representation one can recover which rules were encoded, which is not always the case for dense encoding (e.g. when a single object is governed by multiple rules, or the same relation is applied to different objects).The difference between the two encodings is further discussed in supplementary material.Since sparse encoding provides a more explicit training signal it is the default encoding method for MLCL.Rule predictions

Contrastive training
Figure 3: In all training setups, we start by filling in the context panels with each choice panel.In each case, we generate an embedding with an abstract reasoning encoder network (SCL, HriNet or CoPINet).The embeddings are used as an input to 1) a scoring module which predicts the answer (supervised training, denoted as CE), 2) a rule prediction module which predicts rules as encoded meta-target (auxiliary training, denoted as AUX) and 3) a projection network which maps them into a lower-dimensional representation suitable for computing contrastive loss.The proposed approach combines both auxiliary and contrastive training into a joint learning framework -MLCL.are activated using a sigmoid unit and the loss is calculated using binary cross-entropy.We define the Multi-Label Contrastive Loss for solving RPMs as: where γ and β are balancing factors.For simplicity, we set γ = 1 and β = 10 in all main experiments.Additional results are presented in supplementary material.
Linear evaluation.After the contrastive pre-training step, we discard the projection head g φ , freeze the parameters of the encoder f θ and attach a simple linear scoring head s ψ with a single output neuron.For each RPM problem M i , we employ f θ to generate the matrix representations {h i } ∪ {h i,l } l=1...7 , calculate a score using the scoring head s ψ and apply softmax to produce a probability distribution over the set of possible answers.Using the estimated probability, we optimize the scoring head with a standard cross-entropy loss and keep weights of the encoder network frozen.

Experiments
We compare our method (MLCL) with a fully supervised framework for solving RPMs used throughout the literature [28,37], referred to as CE.In CE both the encoder network f θ and s ψ are used and optimized in an end-to-end manner with respect to the cross-entropy objective function.Since our contrastive training method is inherently based on the description of the RPM rules, we support the supervised training with the same auxiliary loss L aux and rule discovery network ρ, obtaining: where β is a balancing coefficient.This enhanced training is referred to as CE+AUX.Following the setup from [28], we set β = 10 in all experiments.In order to comprehensively compare our method with the baseline, in the experimental evaluation we consider three different state-of-the-art models for abstract visual reasoning and use two abstract reasoning RPM benchmarks.

Datasets.
Our main findings are demonstrated on the newest RPM benchmark set -Balanced-RAVEN [14], which contains visually diverse RPMs with highly compositional structure.Balanced-RAVEN is a modification of the original RAVEN dataset [37] which fixes the defect of biased choice panels.It contains novel relation types which are not present in PGM.Each RPM in Balanced-RAVEN can be governed by up to 8 different rules.Balanced-RAVEN is divided into seven distinct visual configurations.In the default setting each of them contains 10K problem instances.Additionally, we support our studies with the experiments on large-scale PGM dataset [28], containing RPM problems with between 1 and 4 abstract rules.The main purpose of PGM is to test generalisation skills of ML models across various regimes.Each regime is characterised by explicitly defined differences between training and test data, which address various types of generalisation, e.g.interpolation or ability to reason about relations not seen in the training set.The dataset contains 1.42M RPM problems per regime.
Models.We use three different state-of-the-art models for solving RPMs as the encoder network f θ : SCL [35], HriNet [14] and CoPINet [38].In order to reduce the training time and memory consumption of HriNet, we replaced its ResNet backbone [9] in all three hierarchies with a simpler CNN architecture, analogous to the one used in the Wild Relation Network [28].The details are presented in supplementary material.
Implementation details.We devoted a maximum of 50 training epochs with batch size of 256 and learning rate 0.003 for each model on the PGM dataset and 100 epochs with batch size of 128 and learning rate 0.002 for Balanced-RAVEN.Model parameters were optimized using the ADAM optimizer [18] with β 1 = 0.9, β 2 = 0.999, = 10 −8 until the maximal number of epochs was reached or the loss stopped improving on a validation set.All experiments were conducted using mixedprecision training on a single worker with 4 NVIDIA Tesla P100 GPUs, each with 16Gb of memory.

Results
We start experimental evaluation by comparing the proposed MLCL method with other training setups for three different ML abstract reasoning models on the Balanced-RAVEN dataset (see Table 1).For all three models MLCL with data augmentation excels the best results from the literature, and in the case of CoPINet by a significant margin.Furthermore, for all three models MLCL significantly outperforms the base setup (CE), which does not utilize the rule-related information in the training signal.This observation confirms that MLCL is able to absorb and efficiently utilize this additional structural information, which is not a common property, as shown by the CoPINet's performance deterioration when trained with dense encoding.
For SCL and HriNet, MLCL (without augmentation) matches the performance of supervised training supported by auxiliary loss with sparse encoding and excels it by above 10 p.p. for CoPINet.We hypothesize that contrastive nature of our objective function (eqs.( 4)-( 7)) amplifies the benefits of architectural contrastive mechanisms in CoPINet.
Moreover, auxiliary training with sparse encoding (CE+AUX+sparse) always improves performance over the baseline CE setup (with β = 0 in eq. ( 9)).At the same time, CE+AUX+dense results in worse than CE performance for CoPINet, which aligns with the outcomes reported in [38], and yields only a slight improvement for both SCL and HriNet.Hence, for the Balanced-RAVEN dataset, the more explicit, sparse rule encoding scheme is overall beneficial, regardless of a particular ML model.Table 2 presents results on the PGM dataset.Due to huge number of RPM instances, we compare results of the best-performing encoder network -SCL, with the best MLCL configuration -with augmentation.In the most demanding regime (H.O.Shape-Colour) all training methods achieve close-to-random results.MLCL+AUG achieves superior results in 2 regimes and in the remaining 5 cases best outcomes are accomplished by SCL supervised training supported with an auxiliary loss with either dense or sparse encoding.MLCL outperforms the base setup (CE) in 5 regimes.The results show that for the PGM dataset, none of the evaluated training setups clearly stands out.On the contrary -each method seems to present different kind of generalisation abilities.It should be noted, however, that MLCL is tested under linear evaluation protocol, without fine-tuning of the encoder network f θ and therefore reaching the results comparable to fully supervised setups (or even outperforming them in 2 regimes) proves its strength and potential.

Ablation study
In the ablation study we further validate the role of MLCL framework as an auxiliary training method and analyze its contrastive properties on the Balanced-RAVEN dataset.
Data augmentation.Depending on the choice of the encoder network MLCL either achieves comparable performance (SCL and HriNet) or outperforms (CoPINet) supervised approaches even without augmentations (cf.Table 1).However, since most of the contrastive methods heavily rely on data augmentation, we analyze its influence on MLCL for solving abstract visual problems.The  [14] even with a much weaker perceptual backbone.
Joint optimization.MLCL combines both contrastive and auxiliary losses, as shown in eq. ( 8).
We have verified that using either one of its individual components alone is not sufficient for building strong representations.When using contrastive loss only, that is with β = 0, we observed a serious performance downgrade for all models: the accuracy on SCL dropped to 35.3%, on HriNet to 20.7% and on CoPINet to 19.0%.We hypothesize that the lack of auxiliary training information makes it difficult to extract abstract rules and encourages to focus too much on the visual similarity of RPMs, which is unprofitable for the final downstream task.Similarly, setting γ = 0, decreased the accuracy on SCL, HriNet and CoPINet to 46.7%, 31.9% and 24.2%, respectively.This suggests that with purely auxiliary training, models are unable to relate the same abstract relationships to different visual figure configurations.Moreover, the absence of contrastive loss significantly hinders the ability to discriminate between correctly and incorrectly completed RPMs.These observations stress the importance of using joint loss, which is realized by setting both β > 0 and γ > 0.
Contrast strength.The quality of representations learned with contrastive frameworks benefits from applying bigger contrast to positive samples, which is realized by including additional negative pairs while calculating loss [4].Analogously, methods which support multiple positive samples tend to benefit from higher number of positive pairs in a given batch [17].These two properties of contrastive approaches result in the reliance on large batch sizes, which some works tried to tackle by storing additional negative samples in a memory bank [36] or using dynamically updated queue with a moving-average encoder [10,6].Contrary to the above findings, empirical evaluation of MLCL shows that the method does not require large batch sizes and surpasses the performance of traditional supervised training under the same experimental protocol.In fact, for larger batches we observed a slight drop in the final performance across all models, which suggests their negative influence on the auxiliary training.A further analysis of this phenomenon is presented in supplementary material.
In the final linear classification stage, the main goal is to select a correct answer, which requires to discriminate between correctly and incorrectly completed RPMs.We additionally analyze the importance of using RPMs with incorrect answers as additional negative samples, by removing the term Σ i,k from the denominator of eq. ( 6).Unavailability of these RPMs results in a notable drop of performance on the downstream task to 58.0% for SCL, 37.2% for HriNet and 26.5% for CoPINet.In fact, when this additional signal is ignored, the encoder network only learns how to differentiate between correctly completed RPMs governed by different sets of rules, whereas the final classification task requires to differentiate between a correct RPM and a set of incorrectly completed RPMs.We conclude that this additional training signal obtained from RPMs with incorrect answers is mandatory for high downstream task performance.

Conclusion
In this work we propose a novel NCE algorithm suitable for multi-label samples and integrate it with an auxiliary training to devise a new ML approach (MLCL) to abstract visual reasoning tasks.The efficacy of MLCL is tested on a challenging task of solving RPMs which is formulated in this paper as a multi-label classification problem with the 1-1 correspondence between labels and abstract rules underlying a given RPM.The proposed approach establishes new state-of-the-art results on the Balanced-RAVEN dataset and demonstrates superior performance in 2 regimes from PGM.The MLCL framework is additionally supported by a sparse rule encoding scheme for RPMs introduced in the paper, which consistently outperforms the encoding method used in prior works on the Balanced-RAVEN dataset and is the preferred scheme for half of the PGM regimes.

B Sparse vs dense rule encoding
Let us now revisit the two rule encoding schemes discussed in the paper.Santoro et al. (2018) proposed to encode the abstract rules as binary strings of length 12 (called meta-targets) according to the following syntax: (shape, line, color, number, position, size, type, progression, XOR, OR, AND, consistent union).In order to support RPMs with multiple rules, the meta-target for a given RPM is obtained by performing an OR operation on the set of encoded individual rules.
For example, for an RPM instance with rules S = {[OR, shape, type], [AND, line, color]}, the method, referred to as dense encoding in the paper, yields the following meta-target: Based on the resultant string, one is able to conclude that the underlying structure consists of OR and AND relations, shape and line objects, type and color attributes.However, recovering the exact relations governing the considered RPM is not possible.
The new rule encoding scheme proposed in the paper, referred to as sparse encoding, is more explicit and allows to unambiguously retrieve the encoded relations, with the aim of providing a more accurate training signal.Since the set of all possible abstract structures in PGM is composed of 50 elements, i.e. there are |R| × |O| × |A| = 2 × 5 × 5 = 50 unique rules, our method encodes them as a one-hot vector of length 50.Similarly to the dense encoding, we perform an OR operation on the set of all rules for a given RPM.However, due to the nature of one-hot encoding, all information about the underlying abstract structure is preserved.
Analogously to PGM, the rule encoding method for RPMs from the Balanced-RAVEN dataset, proposed by Zhang et al. (2019), represents each rule as a multi-hot vector of length |R| + |A| = 9 and combines the set of individual rule encodings by means of an OR operation.In effect, similarly to the case of PGM, recovering individual rules constituting the final encoded representation is not possible.This problem is further exacerbated by the generally higher numbers of rules in Balanced-RAVEN instances.RPMs in the Balanced-RAVEN dataset contain 6.29 rules on average, compared to only 1.37 rules (on average) in RPMs from the PGM dataset [37].Consequently, for RPMs from the Balanced-RAVEN dataset, the resultant representation obtained with dense encoding is highly ambiguous.
Sparse rule representations of Balanced-RAVEN RPMs are obtained analogously to the PGM ones.In this dataset there are |R| × |A| = 4 × 5 = 20 unique configurations of rules and attributes.However, Zhang et al. (2019) pointed out that applying Arithmetic to Type is counterintuitive.Additionally, in some configurations the rules can be applied to both component structures.Namely, each of the configurations L-R, U-D, O-IC, O-IG consists of two distinct substructures: left/right, up/down, outer/inner, outer/inner, respectively.This gives a total of 38 unique rules, which is the length of our one-hot vector representation.Again, due to the nature of one-hot encoding, all information related to the abstract structure is preserved after applying the OR operation.
In summary, although the advantage of sparse encoding for PGM data is demonstrated only in certain regimes, it consistently outperforms the previous (dense) encoding method on Balanced-RAVEN, due to the significantly higher average number of rules per RPM instance in this dataset.

C Data augmentation
To the best of our knowledge this work is the first to investigate the use of data augmentation strategies in solving RPMs.For each sample, our data augmentation module chooses a random subset from available transformations and applies them in the same way to all RPM panels (images).We consider only image-level augmentations, which do not change the order of RPM panels.The following augmentation methods are used: vertical flip, horizontal flip, rotation by a random angle, transposition, grid shuffle and roll.Grid shuffle transformation splits an image into a 2x2 or 3x3 grid and randomly shuffles its components, whereas the roll operation, rolls an image along the vertical axis, horizontal axis or both axes simultaneously.
Examples of augmented RPMs are presented in Figs.F.1-F.9.Each figure, from left to right, presents (a) the base RPM from either Balanced-RAVEN or PGM, (b) the RPM after applying the

D Implementation details
For the sake od speeding-up the experiments, in each hierarchy of HriNet we have replaced the ResNet backbone with a simpler CNN.This CNN was composed of 4 layers, each with 32 convolutional kernels of size 3x3 and stride equal to 2. Each layer was followed by batch normalization and ReLU activation.All hyperparameters reported in the paper were chosen without any extensive search, based on manual convergence analysis of a limited number of runs.
In all experiments on both Balanced-RAVEN and PGM datasets, we have trained models on images rescaled to the size 80x80, following the setup from [28].This allowed to overcome hardware memory limitations, train with larger batches and conduct extensive comparisons.For the ease of reproducibility, in the source code attachment the experiments are structured as PyTorch Lightning modules.

E Batch size
We have further validated the influence of batch sizes on the results of MLCL training with and without data augmentation (see Figure B.1).We have varied batch sizes in the contrastive pretraining stage and used a constant size of 128 for linear evaluation.
The best performance for all models was achieved with small and middle-size batches, ranging from 32 to 128.We hypothesize that although larger batches allow for increasing the contrast strength of the proposed Multi-Label Contrastive Loss (eq.( 8) in the paper), at the same time they hinder the convergence ability of the auxiliary loss.
This empirical evaluation stresses the importance of balanced interplay between the auxiliary and contrastive losses in our learning framework.

F Loss coefficients
In all experiments reported in the paper we set γ = 1 and β = 10 as the balancing factors used for the Multi-Label Contrastive Loss (eq.( 8)).In order to verify the accuracy of this choice, we    Figure F.9: Augmented RPM from the PGM dataset, with two rules applied to two different object -shape and line.Firstly, note that the first two rows have consistent unions of shape numbers.That is, the first row contains panels with 9, 8 and 0 shapes, whereas the second row with 0, 9 and 8 shapes.This suggests that the missing panel should contain 9 shapes.However, there are multiple choices which satisfy this condition (B, C, D and F).Therefore, it is necessary to further notice the consistent union of line colors in both completed rows.This results in an abstract structure for this RPM defined as S = {[shape, number, consistent union], [line, color, consistent union]} and D being the correct answer.

Figure
Figure Additional ablation studies on the Balanced-RAVEN dataset.The figures present variations in the final classification performance averaged across 4 random seeds depending on: 1) a batch size used for the contrastive pre-training stage of MLCL a) without and b) with data augmentation; 2) balancing factors in the definition of Multi-Label Contrastive Loss c) without and d) with data augmentation.

Figure F. 1 :Figure F. 2 :Figure F. 3 :
Figure F.1: Augmented RPM from the Balanced-RAVEN dataset with configuration Center.In each of two top rows, there is a single shape in each image ([Constant, Number]), located in the same position ([Constant, Position]), the shapes are of the same type [Constant, Type] and size [Constant, Size].In each row there are shapes in 3 different colors ([Distribute Three, Color]).This leads to an underlying abstract structure S = {[Constant, Number], [Constant, Position], [Constant, Type], [Constant, Size], [Distribute Three, Color]}, which is realized by completing the matrix with answer D.

Figure F. 4 :Figure F. 5 :Figure F. 6 :
Figure F.4: Augmented RPM from the Balanced-RAVEN dataset with configuration 2x2Grid.The number of objects in the third column can be calculated by subtracting the number of objects in the second column from the number of objects present in the first column ([Arithmetic, Number]).In rows each panel contains objects with one out of three unique types, sizes and colors ([Distribute Three, Type], [Distribute Three, Size], [Distribute Three, Color]).These relations lead to an underlying abstract structure defined as S = {[Arithmetic, Number], [Distribute Three, Type], [Distribute Three, Size], [Distribute Three, Color]}.The only correct answer which satisfies all the above rules is D.

Figure F. 7 :
Figure F.7: Augmented RPM from the Balanced-RAVEN dataset with configuration O-IG.Similarly to O-IC configuration, the rules of O-IG are applied to both inner and outer structures.Here, the outer structure is defined as S outer = {[Constant, Number], [Constant, Position], [Constant, Type], [Constant, Size], [Constant, Color]} and the inner structure as S inner = {[Constant, Number], [Constant, Position], [Progression, Type], [Distribute Three, Size], [Distribute Three, Color]}.E is the correct answer.

Figure F. 8 :
Figure F.8: Augmented RPM from the PGM dataset.In each row, the number of shapes increases by one in each consecutive panel from left to right.Namely, the first row contains images with 4, 5 and 6 shapes, respectively, whereas the second row with 0, 1 and 2 shapes, respectively.Following this pattern, we expect the missing image in the bottom row to be composed of four shapes, which is realised by choosing the answer A. The underlying abstract structure is defined as S = {[shape, number, progression]}.

Table 1 :
Test accuracy on the Balanced-RAVEN dataset averaged across 4 random seeds.Results are reported for three different encoder networks and the following training setups: supervised with cross-entropy (CE), supervised augmented with auxiliary training with dense (CE+AUX-dense) or sparse (CE+AUX-sparse) encoding and Multi-Label Contrastive Learning without (MLCL) or with (MLCL+AUG) data augmentation.Average -denotes the mean accuracy for all configurations, L-R denotes (Left-Right), U-D (Up-Down), O-IC (Out-InCenter), O-IG (Out-InGrid).For each model, the last row presents the best result reported in the literature.

Table 2 :
Test.Diff.Val.Test.Diff.Val.Test.Diff.Val.Test.Diff.Accuracy in all regimes of the PGM dataset with SCL as the encoder network.We use the same notation for training methods as in Table 1.For each regime, we report results on the validation set (Val.), test set (Test.) and their difference (Diff.).Training and validation sets have the same distribution, whereas the test set has different distribution, specific to a given regime.results of using data augmentation reported in Table 1 (denoted as MLCL+AUG) show consistent improvement for all encoders.This aligns with observations reported throughout the contrastive representation learning literature.Most notably, combining our contrastive framework for solving RPMs with data augmentation reduces the error rate on Balanced-RAVEN to 3.2%, which sets the new state-of-the-art result.Additionally, we see a large performance gain for HriNet, which surpasses the original result reported in