JIST: Joint Image and Sequence Training for Sequential Visual Place Recognition

Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization. Since typically a mobile robot has access to a continuous stream of frames, this task is naturally cast as a sequence-to-sequence localization problem. Nevertheless, obtaining sequences of labelled data is much more expensive than collecting isolated images, which can be done in an automated way with little supervision. As a mitigation to this problem, we propose a novel Joint Image and Sequence Training (JIST) protocol that leverages large uncurated sets of images through a multi-task learning framework. With JIST we also introduce SeqGeM, an aggregation layer that revisits the popular GeM pooling to produce a single robust and compact embedding from a sequence of single-frame embeddings. We show that our model is able to outperform previous state of the art while being faster, using eight times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths.


I. INTRODUCTION
Localization is a fundamental functionality for autonomous mobile robots, and one of its key ingredients is Visual Place Recognition (VPR) [1], i.e., the task of matching a current visual observation (an image or video stream) to previously visited places.For example, VPR is used for loop closure detection in SLAM [2], for re-localization in the kidnapped robot problem [3] and also for pure localization when a map is already available [4] and when GNSS measurements are precluded [5], [6].Additionally, VPR is used to select rough candidates for precise 6-DoF pose estimation (i.e., visual localization) [4], [7].
Across these robotics applications, VPR is typically performed using methods that process short sequences of images acquired by cameras onboard the robot -what is called sequence-to-sequence or seq2seq place recognition [8].A recent trend in this sense is to frame the seq2seq problem as a retrieval task on learnt embeddings (sequence descriptors) that represent entire sequences rather than individual frames [9]- [12].This new paradigm not only intrinsically captures the temporal information in the video stream, but it is also more efficient than individually matching each frame with previous observations [11], [12].However, the accuracy and robustness achieved by sequence descriptors is bounded by the limited availability of large datasets of sequences.Indeed, for the classic image-to-image VPR (im2im [8]) the availability of massive datasets has been instrumental in setting the latest state of the art [13], [14], producing descriptors that generalize better and are very compact 1 .Yet, due to difficulties in curating sequences [8], [16], the largest dataset currently available for the seq2seq task (Mapillary Street Level Sequences [8]) is 40× smaller than the largest datasets for image-to-image VPR [13], [17].
Given the correlation between the seq2seq and the im2im tasks, we argue that it is possible to produce more effective sequence descriptors by jointly training a model not only on sequences, but also on the readily available massive datasets for image-to-image VPR: on one hand, the im2im training from huge-scale datasets would improve the model's generalizability; on the other, sequence-to-sequence learning would embed the model with robustness to sequentially changing scenes and teach it how to temporally aggregate frame-level information.To this end, we propose a new training methodology that jointly uses images and sequences and exploits a stateof-the-art architecture originally developed for im2im VPR to first extract discriminative embeddings from individual frames and then aggregate them.While this new training method enables the model to effectively learn also from large datasets for the im2im tasks, it does not automatically solve the issue of large-dimensional embeddings required by previous SOTA [12].To address this issue, in section III-C we introduce a new aggregation layer called SeqGeM, that revisits the popular generalized mean pooling [18] by applying it along the temporal axis, resulting in very compact descriptors and, consequently, speeding-up the matching time (see fig. 1).The combination of this training method and SeqGeM takes the name of Joint Image and Sequence Training, or JIST.
To summarize, we bring the following contributions: • We propose a novel multi-task training framework to leverage existing large scale datasets of image-to-image VPR and improve upon the seq2seq task [8]; • We introduce the SeqGeM aggregation layer, which revisits the popular generalized mean pooling [18] by aggregating individual frames descriptors along the temporal axis and resulting in compact and robust descriptors regardless of the input sequence length; • We show that, compared to previous SOTA, our pipeline achieves better results and faster inference thanks to its reliance on smaller dimensional descriptors.

II. RELATED WORKS
Sequence matching.Sequence matching, or frame-by-frame matching, represents an established approach to seq2seq [19], [20], and it operates by building a similarity matrix wherein descriptors of single query frames are compared to database ones.The best match is then determined by aggregating the scores in the matrix under simplifying assumptions, such as constant velocity or no stops [21], which makes it hard to generalize to real-world applications.There is a rich literature on sequence matching that tries to relax these assumptions by exploiting ego-motion information or using complex methods [22] and graph-based frameworks [23]- [25].Recently, Seq-MatchNet [26] has also addressed the fact that these methods rely on learned image-to-image descriptors trained without considering the downstream procedure of score aggregation.Despite these improvements, sequence matching can generally be expensive to perform, as it requires each frame from the query to be matched to each frame of all databases sequences, as discussed in [11], [12].Sequence descriptors.Sequence descriptor methods summarize each sequence with a single embedding which can be used for retrieving the most similar matches.This allows to incorporate temporal clues directly into the descriptors and to perform the similarity search directly on sequences rather than frames, thus greatly reducing the matching time.Facil et al. [9] first introduced the idea of sequence descriptors in VPR using simple aggregation techniques such as concatenation, sum, or processing via a LSTM network.[8] extended their benchmark on the Mapillary Street Level Sequences (MSLS) dataset.A non-learnable aggregation via discrete convolution was explored by Garg et al. [10].Alternatively, 1D temporal convolutions were employed in SeqNet [11] to obtain a learnable aggregation of frame descriptors.Recently, [27] demonstrated a hyperdimensional computing approach to systematically combine information from multiple single-image descriptors.Considering the architectural differences among these methods, [12] provides a benchmark and taxonomy for seq2seq methods depending on how the frame-level features are fused together, and then it introduces the SeqVLAD aggregation layer that achieves SOTA performance.A follow up work is found in [28].
Image-to-image place recognition on large databases.There is a parallel body of literature in computer vision on imageto-image place recognition, addressing it as a retrieval task using global image descriptors.For years, the de-facto standard method has been NetVLAD [29], that also introduced the training procedure with mining and triplet loss.However, recently [15] has pointed out that the cost of mining triplets is a major bottleneck that prevents these methods from scaling to large datasets.This consideration inspired few recent papers to pursue mining-free methods in order to enable training on massive datasets.Firstly, CosPlace [13] provides a method to split large dense datasets into non-overlapping classes, which then allows for training to be performed with scalable loss functions.Using a different approach, Ali-Bey et al. [30] provides a dataset that is already split into well-defined classes, allowing to use standard retrieval losses without the need for mining.MixVPR [14] uses a similar training approach, and shows that well-designed architectures can provide a boost in recall.Most recently, [31] introduces a novel reward function, named Generalized Contrastive Loss, to dispense from hardpair mining.This trend in the literature shows that a method that is able to efficiently leverage large scale datasets can bring great benefits for performances.In seq2seq VPR this has not been possible because it is hard to obtain such large datasets.In this paper we propose a training protocol for sequence descriptors that is able to leverage the large amount of data readily available for the im2im task, even though it does not contain sequences of frames.Moreover, we show that our approach is able to improve upon previous SOTA while reducing the cost of deployment.

A. Problem setting
We tackle the task of seq2seq VPR that is formally defined in [8]: given a query sequence the system has to output a sequence from the available database that matches the former.Since the database sequences have GPS labels, this allows to infer an estimate of the query's position.A match is deemed correct if any of the retrieved frames is within 25 meters [8] from any of the query frames.The common recall@N metric [11], [12], [15], [20] is used as an aggregate evaluation, and it represents the percentage of queries that have at least one correct match in the top-N candidates.
Our method builds on the idea that the task of seq2seq VPR can be split into 2 learning objectives: (i) learn to extract features that are distinctive for localization (i.e.ignore transient objects, focus on static components, their style and relative position) and (ii) model the temporal evolution of these salient features within a sequence.In the spirit of deep learning, it is possible to jointly acquire both capabilities in an end-to-end fashion from a dataset of sequences [12].However, we observe that for the first objective we do not necessarily need sequences, but we can exploit existing large-scale nonsequential VPR datasets to embed into our model robustness to a large variety of scenarios.Following this intuition, we devise our multi-task learning framework.

B. Multi Task Framework: Overview
The typical descriptor extractor architecture for im2im retrieval is composed by a backbone and an aggregator of feature maps (or tokens).For retrieval on sequences, there is an additional step to aggregate frame-level information [12].In order to leverage both im2im and seq2seq datasets, we need a unified architecture with a frame-level aggregation layer able to process both individual images and sequences.Thus, we propose a novel double-branched architecture: one branch takes sequential data as input, while the other takes single images (see fig. 2).We iteratively feed each branch with one batch of its corresponding input, compute their respective losses, backpropagate through the entire model and sum the gradients computed for each loss, which are then used for optimization.In doing so, we ensure that both branches share the same gradients and weights: in practice this makes the backbone and fully connected layer (FC) of the two branches identical, and allows joint optimization on both losses in a Siamese-like fashion.Following, we explain how the two branches work at inference and training time.

C. Sequence-to-sequence branch
The sequence-to-sequence branch has the objective of exploiting all the frame-wise information extracted by the im2im branch, via the shared backbone and FC layer, while learning to aggregate temporal information from sequences into compact descriptors.The input to this branch is formed by sequences x seq of frames, with x seq ∈ R L×H×W ×C (where L is the sequence length).The sequences are passed to a backbone B, which extracts L×D dimensional features where the D depends on the backbone.These features are then passed through an FC layer F , which acts as a whitening transformation [18] and produces L D ′ -dimensional descriptors (i.e. one descriptor per frame), where D ′ can be set to a chosen output dimension.At this point, we need two more ingredients: firstly a sequence aggregation module that combines these frames into a single vector (the sequence descriptor); secondly, a loss function for the seq2seq task.Sequence aggregation: SeqGeM.The current SOTA sequence descriptor from [12] is built using the SeqVLAD aggregator.This module reinterprets the classic NetVLAD module [29] to make it suitable for sequences.In a nutshell, given a set of D ′ -dimensional input descriptors SeqVLAD produces a single sequence descriptor vector of size K • D, where K is a parameter indicating the number of clusters used to summarize the input vectors.In practice, the implementation from [12] uses K = 64, which significantly increases the size of the sequence descriptor and, as a consequence, the matching time of the retrieval.To mitigate this problem, [12] uses a PCA compression operation, which nevertheless adds a postprocessing computational overhead.
For all these reasons, we propose a new aggregation module called Sequential Generalized Mean (SeqGeM), which revisits the popular GeM layer [18] to operate on sequences, by applying its pooling operation along the temporal axis given a sequence of single-image embeddings (see fig. 3).Formally, the SeqGeM layer is defined as where p is a learnable parameter and d i is the descriptor of the i th frame.Therefore, the sequence descriptor extraction process is SeqGeM is implemented with differentiable operations, and it has a few desirable properties: i) it natively produces lowdimensional descriptors without requiring a PCA compression; ii) it is learnable; iii) it has few parameters; iv) it is flexible w.r.t. the length of input sequences, so that sequences of different length can be compared to each other.Finally, SeqGeM is purposefully designed to aggregate only the final descriptors of each frame, instead of the frame's feature maps, as in this way it is able to (i) take advantage of the entire im2im branch, which is trained on large amount of images, and (ii) take as input small descriptors and produce small outputs, whereas usually methods that take as inputs the feature maps (e.g.SeqVLAD) produce large-dimensional sequence descriptors which increases memory and time requirements.Seq2seq loss.Following best practices from the literature, we use the popular weakly supervised margin triplet loss [29], which takes a query, its positive (a sequence from the same place), and a negative.For best results, negatives need to be mined, because selecting random negatives would lead to trivial triplets (i.e., with loss 0), by selecting the negatives closest to the queries in features space.Given triplets of query, positive and negative, the weakly supervised triplet loss, used to train the seq2seq branch, is defined as: (3) where f q seq , f p seq , f n seq represent the features of a query, its positive and negative, m is the margin of the triplet loss, and d(•) is the euclidean distance between two features.

D. Image-to-image branch
The second branch processes single images instead of sequences: given input images x im ∈ R H×W ×C the image branch produces D ′ dimensional local feature descriptors which can be fed to the image loss.The local feature descriptors are computed as where the backbone B and fully connected layer F are shared with sequence-to-sequence branch (see fig. 2).Finally, we attach a loss L im2im for the image-to-image task, that backpropagates through B and F .Im2im loss.Since our goal for this branch is to exploit huge datasets of single images to learn robust representations, we resort to the CosPlace training protocol and loss [13] that is the current state-of-the-art for large scale im2im VPR and was designed to be used on the massive San Francisco eXtra Large (SF-XL) dataset.Below we provide a summarized explanation of the CosPlace training protocol, although we note that this is not meant to be a thorough description and we refer the reader to the original CosPlace paper [13] for a more detailed explanation.
The CosPlace training protocol is divided in two steps.In the first step, the SF-XL dataset which contains images labeled with UTM coordinates and heading angles is partitioned into classes based on their position and orientation.This process, that is performed once prior to the actual training, divides the geographical area into small squared cells (10×10 meters) and splits each cell into 12 classes along the orientation/heading (i.e. each class is 30 • wide), thus ensuring that all the images in a single class view the same scene (by having similar position and orientation).This division of the continuous label space in a finite number of classes enables the usage of highly scalable losses for large-scale image retrieval, such as the CosFace loss [32].
Therefore, the second step consists in training the model using the CosFace loss on the obtained classes.However, naively using all the classes would be problematic, because images in two adjacent classes may have a very high visual overlap, thus potentially containing the same scene seen from slightly different points of view.Since this would lead to unstable gradients during optimization, the training protocol only considers images from a subset of classes chosen so that no two adjacent classes are used at the same time.This subset is not fixed, but it is changed iteratively during training, to allow the model to see all the images in the dataset.
Summarizing, in this paper we denote as L im2im the CosFace loss applied according to the CosPlace protocol.However, we want to remark that in principle our multi-task framework is loss-agnostic, so the im2im loss can be easily swapped with another one, for example if a more performing loss becomes available.

E. Total multi-task loss
Overall, the total loss of our multi-task framework is where λ seq2seq and λ im2im are hyperparameters.The combination of this multi-task loss with architecture that includes the SeqGeM aggregation makes our multi-task framework, which we name Joint Image and Sequence Training, or JIST.

IV. EXPERIMENTS A. Experimental setup
Datasets.To assess the soundness of the JIST multi-task training framework, we use the following datasets: • Mapillary Street-Level Sequences (MSLS) [8], is built from various cities around world, split in non-overlapping training, validation and test sets, and consisting of 393k query sequences and 733k for the database (if we consider 5-frames sequences).As the original test set labels are not released by the authors, we follow the splits defined in [12].
• Test set: Copenhagen, San Francisco  • MSLS Melbourne is the subset of MSLS from the city of Melbourne, and it is commonly used [11], [12], [26] to understand the effect of training only on a single city as opposed to the entire MSLS train set.When the model is trained on Melbourne, the validation and testing are performed on the standard MSLS val and test sets.
• San Francisco eXtra Large (SF-XL) [13] is a large-scale (41M images) im2im dataset covering the whole city of San Francisco, and it is used as a training set for the CosPlace component of the loss.Note that CosPlace requires camera heading labels, meaning that most other datasets (MSLS included) can not be used for training CosPlace.
• Oxford RobotCar [35] is a small dataset containing roughly 4k queries and database sequences in each split.It contains multiple traversals of the same path around the city of Oxford.Laps are recorded in different times of the day, year, as well as changing weather conditions, targeting robustness to domain shifts.In the literature there is little consistency upon which splits to adopt [10], [11], [26], thus as with MSLS we follow the proposed one in [12].For training, we set λ im2im = 100 and λ seq2seq = 10.000.The learning rate is set to 0.00001 and we use Adam [36] as optimizer.We train our model for a fixed number of iterations, namely 12.5k.To speed up convergence and reduce carbon footprint of our trainings, we initialize the backbone with the open-source pretrained weights from CosPlace.Regarding our architecture, we use a ResNet-18 [37] backbone which has an output dimensionality D = 512.We keep the same dimension after the linear projection D ′ = 512, except for experiments in table III where we show that our method works well also with smaller descriptors.The parameter p of SeqGeM is initialized to 3. Evaluation.We use a standard kNN to find the predictions for each query.As metric, we use the Recall@N, defined as the number of queries that have at least one correct positives within the first N predictions.A prediction is deemed correct if at least one of its frames is less than 25 meters away from at least one the query's frames, following [8]'s definition of seq2seq.Unless otherwise specified we use a sequence length of 5 following previous work [12], although in fig. 5 we show that SeqGeM is able to produce robust descriptors even with different sequence lengths.Given that in VPR it is logical to either train and test on different (non-overlapping) geographical areas [29], or to consider the train and test sets to be geographically overlapping [13], we compute results for both cases: results on MSLS use geographically disjointed sets, whereas results on RobotCar use the same area for training and testing.Methods.We report results from a large number of methods on the task of seq2seq VPR.Wherever available, we made use of the authors official code for our comparisons.For methods based on the traditional sequence matching, we compare against three popular implementations: SeqSLAM [20], HVPR [11], and SeqMatchNet [26].We also compare to existing methods based on sequence descriptors.Starting from the work of [8], we test standard concatenation (CAT) of popular im2im descriptors NetVLAD [29] and GeM [18] using different backbones.We also compute results with Delta Decriptors [10], a non-learnt pooling in this category.We compare against Fully-Connected layers on top of flattened frame descriptors [9], varying the feature extractor.Additionally, we test the learnable pooling of SeqNet [11] and the previous SOTA represented by SeqVLAD [12].Finally, we test a method that processes all frames as a single entity from the first layers, namely the TimeSformer [34].
For methods that produce huge descriptors, mostly due to NetVLAD applied on each frame of the sequences, we followed [12] and applied PCA for dimensionality reduction.It is noteworthy in this sense that our proposed pipeline naturally outputs compact descriptors (512-D) freeing ourselves from the extra cost of applying PCA, while also achieving higher results despite the lower dimensionality.
A few methods (HVPR, SeqMatchNet and SeqNet) could not be trained on the whole MSLS due to large memory requirements of their implementation (more than 256 GB of RAM), hence why some results are missing.Finally, we clarify that official code for Delta Descriptors and SeqSLAM do not train frame-level descriptors and rely on pre-trained networks.In the table they are highlighted with *.

B. Results and discussion
To empirically assess the effectiveness of our proposed models against previous literature, we report a wide set of experiments in table I, and precision-recall curves for the most relevant methods in fig. 4.
We summarize the findings from experiments as follows: • JIST achieves excellent results with small-dimensional descriptors, even when trained on fewer sequential data (i.e.training on Melbourne); • SeqVLAD achieves overall good results, but its recalls are poor when trained on fewer data; • Despite its strong results, JIST is extremely fast and uses a simple model for inference; • Extraction time depends mostly on the backbone, and only slightly depend on the aggregation layer (e.g.CAT, SeqGeM, FC); • On all considered testing datasets, extraction is the bottleneck, although for a bigger dataset matching would be slower, as its speed linearly depends on dataset size; • We empirically verified that matching time is linearly correlated to descriptors dimension for sequence descriptors (i.e.pure retrieval) methods; Computational cost.Besides being fast to train (less than 10 hours on a single GPU), JIST provides very efficient inference, due to small descriptors and lightweight architecture.Specifically, we rely on a ResNet-18, which has only 11M parameters, leading to fast features extraction time.
Matching time is also small (8 times smaller than previous SOTA), due to SeqGeM's compact output: in fact the matching time (i.e.time it takes to find the matching descriptors to the query's through a kNN) depends only on the descriptors' dimension and the size of the database.Note that, as we scale to larger datasets (with more sequences in the database), the bottleneck of a VPR system at inference shifts from the extraction to matching, making compact descriptors and fast matching an important characteristic for large-scale deployment [15].Ablation on the loss.In this paragraph we aim at understanding how each component of the loss affects results, to justify their use in training.In table II we report results computed with different weights for λ seq2seq and λ im2im , with a ResNet-18 and our proposed SeqGeM layer.We find that when any of the two has a null effect on the backpropagated gradients, the results are evidently lower, proving that both learning objectives are beneficial to the task.Note that using λ seq2seq = 0 means that only the im2im loss is used (therefore SeqGeM is not trained, but simply initialized to 3).The best results are shown with values of λ seq2seq = 10.000 and λ im2im = 100.Finally, we note that the L seq2seq has a stronger effect than the λ im2im , as not using the L seq2seq leads to a 3% points in reduction with respect to the best model.This effect proves the fact that while it is possible to learn to extract salient features for localization using only single images, a loss that instructs the model how to aggregate temporal information is necessary.Clearly, all methods benefit from longer sequences: with more frames, descriptors become more informative, limiting perceptual aliasing.Models trained with JIST outperform all competitors, especially with very short sequences: this is expected behaviour, as the image loss allows to extract informative features even from a single frame.Effect of reversing frames.Robustness to frame ordering is a desirable property in some realistic use-cases, because it allows to reduce the number of sequences stored in the database.Following [9], [11], [12] we assess each model's robustness to reversing the frame ordering for queries sequences, while keeping the database untouched, and report results in table IV.SeqGeM is inherently robust to frame-ordering, as well as SeqVLAD and TimeSformer which processes the sequence in its entirety.On the other hand, methods based on FC-layers, CAT or sequence matching are the ones that suffer most in this scenario.Ablation on aggregation layer.Given their importance in aggregating features from multiple frames, in table V we report experiments performed with a number of pooling/aggregation layers.This shows a number of desirable properties that are satisfied by SeqGem, as well as showing its superiority of results.In particular, we note that the SeqGeM aggregator provides the following characteristics: (i) learnable, (ii) flexible w.r.t.length of input sequences, (iii) invariant to frame ordering, (iv) lightweight, besides producing compact output and having few parameters.Note that the results from table V are performed within the JIST framework/pipeline, making these aggregations achieve superior recalls w.r.t.most of the baselines from table I.

C. Considerations for real-world deployment
As the use of deep models for seq2seq VPR becomes widespread, we investigate the feasibility of deploying such models in the real world.We perform experiments on a Jetson Nano platform.Considering the scenario of a large city like San Francisco, with 1600 kilometers of road, it would require roughly 800k sequences to map the whole city.Using the previous state-of-the-art model, namely CCT384 [33] with SeqVLAD [12], it needs ≈ 36GB (#sequences * descriptors dimension * #bytes) of memory to store all the descriptors.More compact representations (commonly compressed with PCA) usually rely on 4096-D features [12], at the cost of a performance penalty.With SeqGeM however, we are able to outperform previous state of the art with 512-D descriptors, which needs only 800k * 512 * 4B ≈ 0.75GB, and can be handled by a Jetson Nano.
Given this setting, we analyzed the inference time on a Jetson Nano: we found that extraction time for a sequence takes 276 ms (i.e. with our ResNet-18; does not depend on the size of the database).Matching takes 3.1 seconds with a vanilla kNN (on the whole city of San Francisco).We note that previous works on im2im VPR found that kNN can be sped up by up to 64 times with negligible loss of recall [15] when using approximate/efficient versions of it, like Inverted File Index with Product Quantization [38], [39], leading to a potential processing speed of roughly 3 sequences per second (276ms + (3100/64)ms = 324ms), whereas previous SOTA (with descriptors dimension 24576) would process only 0.4 sequences per second.Even with PCA, the throughput would still be limited to 1.4 sequences per second.

V. CONCLUSION
This work proposes a novel training algorithm that efficiently exploits existing data sources to boost performance in sequence-based VPR.We introduce a trainable temporal aggregation layer designed to being flexible to input length and frame ordering, all while guaranteeing compact descriptors.Through extensive experimental evaluation we showcase the improvements that JIST achieves over previous SOTA, as well as robustness to different conditions such as changes in frame ordering, sequence length and different datasets.We empirically demonstrate that our model is able to not only achieve better results, but also be faster and lighter (in terms of RAM and GPU memory).Limitations.Although sequence descriptors are a competitive solution to obtain efficiently a coarse global localization estimate even in very large environments, their use is intended when there is the need to search in a large number of sequences (e.g., for loop closure or to bootstrap the localization when lost).Furthermore, we note that a limitation of the current JIST framework is that the two losses require different format of datasets, where the im2im branch is trained on large-scale single-image datasets whereas the seq2seq branch requires continual sequences.Future works.Possible directions for follow-up works may explore different strategies for extracting knowledge from large pre-trained models (e.g.distillation), generalizing our multi-task framework to other tasks, or using more than two branches to gather knowledge from other data sources.

Fig. 1 .
Fig. 1.Our multi-task training framework allows to surpass previous SOTA in performance.Thanks to our novel layer SeqGeM, we are able to cut down the matching time by an order of magnitude.

Fig. 2 .
Fig. 2. Overview of the JIST framework.At training time (left) we use two branches, one for sequences and one for single-images.Each branch has a separate loss, while sharing part of their weights.The multi-task training allows to obtain discriminative frame-wise embeddings by exploiting the powerful representations learned by the backbone and fully connected from single images.At test time (right) we only use the sequences branch, and we follow the standard image retrieval pipeline: embeddings are extracted for both database and queries sequences, and then a prediction for database sequence that is most similar to the query is computed through a kNN.Note that in a real-world scenario, the potentially expensive embeddings extraction for database sequences can be performed offline, making the framework fast (more information on efficiency in section IV-C).

Fig. 3 .
Fig. 3. Sketch of our proposed SeqGeM layer.Given D-dimensional feature vectors from L frames, SeqGeM produces a single descriptor/embedding of dimensionality D, which contains information from the whole sequence.

Fig. 4 .
Fig. 4. Precision-Recall curves computed on MSLS test set for the most relevant methods.All models are trained with a ResNet-18 backbone except TimeSformer, which uses a custom backbone.

Fig. 5 .
Fig. 5.The plot shows how different methods react to changes in the dimension of test-time sequence length (i.e.number of frames).All methods are trained with fixed sequence length of 5.

TABLE I EVALUATION
OF SEQUENTIAL DESCRIPTORS AND SEQUENCE MATCHING, ON SEQUENCES OF LENGTH 5: RECALL@1 ON VARIOUS DATASETS.SL STANDS FOR SEQUENCE LENGTH, CAT INDICATES CONCATENATION OF DESCRIPTORS, FC STANDS FOR FULLY CONNECTED LAYER.EXTRACTION TIME IS THE TIME TO EXTRACT DESCRIPTORS/EMBEDDINGS, AND MATCHING TIME IS THE TIME TO FIND THE PREDICTIONS GIVEN THE DESCRIPTORS GIVEN THE TEST DATABASE OF MSLS (WITH 13584 SEQUENCES).BOTH TIMES REFER TO A SINGLE QUERY.* DENOTES A NON-TRAINED METHOD.BEST RESULTS IN BOLD, SECOND BEST ARE UNDERLINED.

TABLE II ABLATION
ON THE TWO COMPONENTS OF THE MULTI-TASK LOSS, ON MSLS.BEST RESULTS IN BOLD.
Effect of sequence length.In fig.5we investigate the effect of changing the number of frames within sequences (sequence length) at test time, without re-training the model,

TABLE IV ROBUSTNESS
TO THE INVERSION OF THE FRAMES, AS R@1.BEST (LOWEST) DIFFERENCES WHEN INVERTING FRAMES IN BOLD.

TABLE V
COMPARISON OF AGGREGATION LAYERS.RECALL@1 IS COMPUTED WITH SAME TRAINING CONFIGURATION ON MSLS SPLITS.