Activities of Daily Living Monitoring via a Wearable Camera: Toward Real-World Applications

Activity recognition from wearable photo-cameras is crucial for lifestyle characterization and health monitoring. However, to enable its wide-spreading use in real-world applications, a high level of generalization needs to be ensured on unseen users. Currently, state-of-the-art methods have been tested only on relatively small datasets consisting of data collected by a few users that are partially seen during training. In this paper, we built a new egocentric dataset acquired by 15 people through a wearable photo-camera and used it to test the generalization capabilities of several state-of-the-art methods for egocentric activity recognition on unseen users and daily image sequences. In addition, we propose several variants to state-of-the-art deep learning architectures, and we show that it is possible to achieve 79.87% accuracy on users unseen during training. Furthermore, to show that the proposed dataset and approach can be useful in real-world applications, where data can be acquired by different wearable cameras and labeled data are scarcely available, we employed a domain adaptation strategy on two egocentric activity recognition benchmark datasets. These experiments show that the model learned with our dataset, can easily be transferred to other domains with a very small amount of labeled data. Taken together, those results show that activity recognition from wearable photo-cameras is mature enough to be tested in real-world applications.


I. INTRODUCTION
Activity recognition through wearable devices has been largely investigated in the past fifteen years [1]. While early works were mostly based on the use of simple wearable sensors such as accelerometers and heart monitors, during the last decade, a wide variety of sensors have been incorporated into different and more sophisticated types of wearable devices, ranging from motion to radar sensors.
The use of wearable cameras in the context of activity recognition began only very recently. Being small and lightweight, wearable cameras are ubiquitous and can autonomously record data without human intervention during The associate editor coordinating the review of this manuscript and approving it for publication was Peng Liu . long periods of time. Unlike other wearable sensors, they capture external and directly interpretable information, such as places, objects, and people around the user. With respect to fixed cameras, wearable ones can daily gather large amounts of human-centric data in a naturalistic setting, hence offering rich contextual information about the activities of the user. As a consequence, activity recognition from wearable cameras has several important applications as assistive technology, in particular in the field of rehabilitation and preventive medicine. Examples include self-monitoring of ambulatory activities of elderly people [2], [3], monitoring patients suffering dementia [4], [5], determining sedentary behavior of a user based on their spent time watching TV [6].
However, the opportunities for activity recognition from wearable cameras come along with several challenges as well. Our activity of daily living monitoring system via a wearable camera: A deep neural network architecture is trained on images coming from a specific camera and, more importantly, from a given group of people (ADLEgoDataset). We show that this deep neural network can be successfully used on pictures captured by distinct cameras and/or unseen people with different lifestyles (jobs/hobbies/cultures), needing just a very small amount of new labeled data; consequently, the system might be deployed in real-world applications, where typically the training data distribution differs substantially from the target data distribution.
The main one is to predict the user activities based not on the observation of camera wearer himself (with his bodypose, gestures, etc.), but on her/his context: the objects he/she is manipulating, other people around he/she is interacting with, and the environment itself. Additionally, first-person (egocentric) images suffer huge intra-class variation, due to the camera user not being static and also acting in a large variety of real-world scenarios. Even more, the lighting conditions are not fixed since the camera can be worn in indoor and outdoor settings at different times of the day. Lifelogging photo-cameras present yet another specific challenge with respect to wearable video-cameras. They continuously take pictures at regular intervals of 20-30 seconds instead of videos with a high number of frames per second, generating image sequences with a low frame rate, typically called visual lifelogs or photo-streams. Therefore, motion estimation, that is useful to describe the scene [7] and disambiguate actions/activities [8], become infeasible on such data.
Besides these technical challenges, recent work has shown very good performance on the task of activity recognition from visual lifelogs. This has been achieved mainly by leveraging deep learning architectures aiming at capturing the temporal evolution of semantic features over time, together with their contextual information [9].
However, as noticed in [10], these methods would need a more extensive validation on a larger scale dataset and on unseen users before being deployed in real-world applications. Indeed, in real scenarios, the distribution of the training, also called source, typically differs from the distribution of new data, also called target. For instance, this is always the case when the target data are acquired by a different wearable camera than the source data, or when the target data have been collected by people having a very different lifestyle than those who collected the source data, i.e. having different jobs/hobbies and living in different countries. In addition, new data can be unlabeled or scarcely labeled. Therefore, ensuring performance on unseen users from the same domain does not assure that the model could be employed in real-world applications. In addition, to guarantee the robustness of the method, performance should keep stable on larger and more varied datasets. To the best of our knowledge, currently, there are not large scale dataset of activity recognition from visual lifelogging. This is mainly due to several difficulties to be handled: bystander and user privacy concerns during data collection, the huge effort of the tedious manual annotation process, the lack of a standardized action/activity vocabulary, and the inherent ambiguity of the data annotation itself.
To cope with all these needs for real-world deployment, we first collected a large egocentric dataset acquired through a wearable photo-camera and we used it to validate for the first time the generalization capabilities of five existing methods for egocentric activity recognition on unseen users. In addition, we quantified the effectiveness of using together images from different domains in the same training/test setup. Furthermore, we show that the model trained on our dataset can be easily transferred to other domains (i.e. datasets collected by other wearable cameras, different users, etc.), achieving competitive performance with a small amount of labeled images. An overview of the above described capabilities of our system is given in Fig. 1.
More specifically, our contributions in this paper are three-fold: (i) The collection, annotation, and release of a large egocentric dataset of Activity of Daily Living (ADLEgo-Dataset a ) consisting of 102,227 images, from 15 users, with an average of 6,682 images per user.
(ii) The ranking of state-of-the-art algorithms in dealing not only with unseen full day sequences but also unseen users during training. This ranking also provides a strong baseline for our newly introduced dataset.
(iii) A set of experiments using the correlation alignment (CORAL) adaptation method [11], [12] showing that the model learned with our dataset can be easily and successfully transferred to other existing datasets acquired by two different wearable cameras (i.e. NTCIR-12 [13], [14], and Castro et al.'s [15] datasets), providing competitive results with a very little amount of labeled data.
The rest of the paper is organized as follows. First, in Section II we review the related work. Next, in Section III, we introduced our ADLEgoDataset. In Section IV, we present a classification baseline using state-of-the-art algorithms on our dataset. In Section V, we present our model and experiments on how to transfer the learned model to other domains. Finally, we present our conclusions in Section VI.

II. RELATED WORK A. ACTIVITY RECOGNITION FROM FIXED CAMERAS
The standard pipeline of human action recognition was introduced in the seminal work of Yamato et al. [16]. This pipeline consists of first extracting feature vectors from a sequence of frames and then predicting an action based on them by using a classifier. This general approach has been extensively used in the past by varying the hand-crafted features and the type of classifier, and it is still in use nowadays [17]. This Computer Vision task, along with several others, has made great strides since the introduction of deep convolutional neural networks (CNNs) [18]. These networks learn feature representation from images and their classification in an end-to-end fashion. Over the last seven years, new architectures that improve their efficiency and accuracy have been presented [19]- [23]. Although CNNs do not model the temporal order of frames from the sequence, temporal learning mechanisms have been used on the top of them, i.e. fusion mechanisms [24], three-dimensional convolutional layers [25], and long short-term memory (LSTM) units [26], [27]. Specific deep architectures for action recognition have combined optical flow as an additional stream [28], and, later on, multimodal information such as audio [29]. In this work, our attention is set on activity recognition from wearable cameras. Its main difficulty is that the person himself is only partially visible in the images through his hands. Although the approaches detailed above have been adapted to this kind of camera, other methods have been proposed that rely solely a The dataset is publicly available at http://www.ub.edu/ cvub/adlegodataset on the user interactions with objects, other people, and the scene. These methods are described below.

B. ACTIVITY RECOGNITION FROM EGOCENTRIC VIDEOS
Several works on first-person action recognition from videos have focused on exploiting egocentric features. These features include the location of hands [30]- [32], the interaction with active/passive objects [33]- [38], the head motion [39], [40], the gaze [41]- [45], or a combination of them [46]- [48]. Other methods have explored egocentric contexts like social interactions [49] and the temporal structure of the activities [50]- [52]. Additionally, some approaches have adapted deep third-person action recognition methods [53]- [55] and developed new ones based on reinforcement learning [7]. In this work, we focus on activity recognition from visual lifelogs. In contrast with egocentric videos, they cover longer time periods with a low temporal resolution, hence being suitable for several applications of assistance technology [2]- [6]. Nevertheless, most of the approaches described above cannot be used on visual lifelogs because motion and gaze based features cannot be reliably estimated on such data.

C. ACTIVITY RECOGNITION FROM VISUAL LIFELOGS
Initial work on first-person action recognition from visual lifelogs was presented by Castro et al. [15]. Their approach, based on a late fusion strategy applied at frame-level, combines the output of a CNN with color histograms and timestamps. These additional contextual features are justified by the fact that a person typically performs activities such as cooking in the same place and about the same time per day. However, this approach has been tested on a dataset acquired by a single user and makes sense only for a single user or several users having the same lifestyle (similarly working hours, same job, etc). A generalized version of this method was proposed in [60], where the outputs of different layers from a CNN were combined to extract more general contextual information. More recent work [9] modeled lifelogs as sequences instead of a set of unrelated images and proposed two methods based on LSTMs for exploiting the temporal evolution of contextual features over time. Recently, information from different wearable devices, including a camera, was integrated using multimodal approaches for activity recognition. While these methods are promising and are tested on unseen users, they typically rely on off-the-shelf architectures for the visual modality [61], [62]. In this work, we provide a solid proof of the generalization capabilities of several stateof-the-art architectures for activity recognition from visual lifelogs by validating them on a new, large visual lifelog dataset.

D. DOMAIN ADAPTATION
Domain adaptation (DA), also known as the dataset shift problem [63] and mathematically formalized in [64], deals with scenarios where a model trained on a source distribution does not generalize well in the context of a different (but related) target distribution. Two of the currently most predominant approaches to address the DA problem are based on the two-stream deep architecture first presented in [65]. Each of the streams represents the source and target model, respectively. A carefully designed domain regularization loss is employed to adapt the source to the target domain. One approach is to reduce the shift between domains using a discrepancy metric such as the maximum mean discrepancy (MMD) [65]- [68], the central moment discrepancy [69], [70], the correlation alignment (CORAL) function [12], and the Wasserstein metric [71]. Inspired by [72], another successful approach is to find a common feature space using adversarial training [73]- [76]. For example, in [74], a source encoder CNN is trained and its weights are subsequently fixed to train a target encoder. The adversarial training of the target encoder aims to deceive a domain discriminator between samples from both domains. Along with the same approach, Ganin and Lempitsky [73] simultaneously trained a generator and a discriminator by inverting the gradients using a special layer.
In this work, with the aim of measuring the effectiveness of our model on data acquired by different cameras and people, and hence having a different distribution with respect to the training data, we use a DA technique on our proposed dataset, i.e. the source domain, and two other available datasets [13]- [15], i.e. the target domains. Although DA is characterized by not having labeled data on the target domain, we consider it in a semi-supervised context, where different amounts of labeled target examples are taken into account.

III. ACTIVITIES OF DAILY LIVING EGOCENTRIC DATASET A. RELATED DATASETS
Although several egocentric datasets for action recognition have been published in the last years [77], [78], most of them were recorded using video cameras. Since these devices have much higher energy consumption than lifelog cameras, each video in these datasets do not cover actions from whole days but capture up to a few hours. Furthermore, considering the obtrusiveness of the cameras, that are typically mounted on the head, most of these datasets only include actions in specific, often indoor, environments. For instance, several existing datasets have focused on tasks like cooking [41], [43], [50], [78], [79], interacting with a toy in a laboratory [49], working [80], [81], or performing indoor daily activities [35], [82]. Only a few datasets captured outdoor activities such as basketball [2] or ambulatory activities [83].
During the last five years, a reduced number of egocentric visual lifelog datasets for action recognition has been introduced. Unlike the egocentric video datasets described above, these datasets cover full-day activities performed in a larger variety of settings. Both characteristics made the lifelogging data collection more difficult to acquire. First, it requires longer recording times that also makes the process more expensive. Second, recording several locations and people during a day has more privacy restrictions than indoor locations. One of the first datasets was introduced in [15] and released in [59]. It describes the life of only one graduate student using 19 different activities, therefore it does not allow to test generalization capabilities on other users. Several other datasets have been presented in the context of image retrieval challenges [13], [56]- [58]. Although they capture images from several weeks, the number of originally annotated classes and images is low and mostly describes transportation and ambulation activities. The life of three unrelated subjects was presented in the NTCIR-12 challenge [13]. Although it was independently annotated with 21 daily activity labels by [14], it only considers three people. Another dataset for image retrieval consisting of the annotated moments of two people [56] was released and further labeled in terms of four different activities. This dataset was further used for another image retrieval task in [57]. Finally, a dataset consisting of two subjects and two distinct activities was introduced for the NTCIR-14 challenge [58]. The characteristics of the above described datasets and ours are summarized in Table 1. This Table not only considers the number of people, annotated classes, and images; but also the VOLUME 8, 2020 number of day lifelogs. This latter number is relevant since a robust performance evaluation on frame-sequence data must be done on full sequences and not only frames. Moreover, Table 1 also highlights the diversity of the activities of our dataset as having classes belonging to more activity groups among the ones proposed in [1]. The main difficulties with existing visual lifelog datasets are: (i) the small number of users and lifelog sequences, that prevent to thoroughly test the generalization capabilities of machine learning methods for activity recognition; (ii) the limited number of activity categories and their diversity.
Here, we introduce our ADLEgoDataset, a collection of 105,529 images describing the lifestyle of fifteen postgraduate students. In comparison with previous visual lifelog datasets, the activities are not constrained to a specific domain and occurred in a wide variety of indoor and outdoor locations of a city. The set of activity labels is based on previous works [14], [15], [59] and further expanded to 35 activities, thus adding 14 more categories. Moreover, the number of users is greater than in existing datasets by 12 people as seen in Table 1, allowing them to perform a generalization test.

B. DATA COLLECTION
The data was collected by fifteen computer science postgraduates students who wore a lifelogging camera using a lanyard hanging around the neck. The number of female and male participants were 3 and 12, respectively. The collected pictures depict different outdoor and indoor locations across one city. The common place for all the participants was the university where they work or study.
The participants were instructed to perform their daily activities while wearing the camera during whole days. However, they were allowed to put away the camera on situations that they considered private, e.g. using the toilet. They were asked to use the camera in a minimum period of 10 days. For privacy concerns, all participants were allowed to discard pictures that they considered sensitive, even images from whole days.
We used the first and second versions of the Narrative Clip camera, but only two people wore the first version. Both cameras automatically take a picture at ≈ 30 seconds rate, but their main difference is that the latter has a wider field of view and an 8 megapixels resolution instead of 5. They can operate in a period of 10 to 12 hours without a battery recharge, thus allowing to capture between 1,200 and 1,900 images per day.
The selected categories for the dataset are general activities from five different egocentric groups [1], as seen on Table 2. The activity labels were based on previous works [14], [15], [59], but they were not specifically targeted to model the student lifestyle and were selected after the recording. As an illustration, the original number of categories proposed for annotation included a broader set of activities such as child rearing, praying, painting or meditating. However, they were not chosen by any participant during the annotation process.

C. ANNOTATION PROCESS
Most of the participants were also involved in the annotation process since they are the best judges to determine not only what they were doing, but also when an activity started and ended. The correct activity boundaries in a lifelog sequence are important because the temporal context of between frames provides more information in the case of occlusions from single frames. For instance, in a cycling sequence a frame might not show the bicycle steering wheel and could be classified as walking outside. We used the batch-based annotation tool introduced in [60]. Finally, each recognizable face of people not directly involved in the data collection was manually blurred.

D. DATASET DETAILS
We collected over 105,529 pictures from 15 college students and researchers, covering in total 191 days and 35 activities. These activities belong to five of the seven categories presented in [1], as seen on Table 2. The young student lifestyle is implicitly reflected on the number of instances of each activity, for instance, the times the labels used a computer and gone to a bar frequently appear. The only location in common for all volunteers was the university and most of the time they did not meet while wearing the camera. Fig. 2 illustrates different settings and activity sequences of three of the users. Each participant wore the camera a different number of days and times, and the mean number of days that the participants wore the camera was 14.6, resulting in 6,816.73 images on average.

IV. ACTIVITY RECOGNITION FROM LIFELOGS
In this section, we aim at ranking the generalization capability of state-of-the-art algorithms for activity recognition from visual lifelogs. Since our focus is only on visual information, we selected algorithms that do not rely on additional data from other sensors as [61], [62]. Our baseline considers two classification approaches for visual lifelogs: still images and image sequence based approaches. The first scenario consists in determining the activity a person is doing from a single frame; whereas the second scenario takes as input images from a full-day sequence that typically covers several daily activities.
We selected two still image classification methods as a baseline. The first is a convolutional neural network (CNN) that serves as a backbone for the rest of the algorithms. Specifically, we used ResNet-50 as backbone network. The other method is a late fusion ensemble that was introduced by Castro et al. [15] and further generalized by Cartas et al. [60]. Their approach consists of combining different output layers from a CNN using a random forest (RF) as a final classifier, thus named CNN+RF. Concretely, we combined the outputs of the average pooling and the fully-connected layers.
In the case of image sequences, we evaluated the two temporal training approaches presented in [9]. These approaches extract the contextual features from a CNN and use LSTMs as a sequence learning mechanism. The difference between these approaches consists in their training strategy. The first approach trains directly over the full-day image sequence. The second approach trains using a fixed number of LSTM units and sampling a day sequence in a sliding window fashion. Specifically, we tested both LSTM training strategies using as input feature extractors the CNN and CNN+RF methods described above. In order to make a fair comparison between the features extracted from CNN and CNN+RF, the CNN weights were frozen during the training of LSTM.
In addition to these image sequence approaches, we also consider an LSTM variant as a temporal learning mechanism. Namely, we combined the encoding produced by a CNN with a Bidirectional LSTM (BLSTM) [84]. This kind of Recursive Neural Network (RNN) evaluates a sequence in forward and backward order and merges the result. Thus, it captures patterns that might have been missed by the unidirectional version and that can lead to potentially more robust representations. We implemented the CNN+BLSTM and CNN+RF+BLSTM methods using the same training approaches described above. All our ranking baseline models are depicted in Fig. 3.  [20]. In order to have a fair comparison, after fine-tuning the CNN baseline model, it was used as a feature extractor for the rest of the evaluated methods.
With the aim of having a more realistic testing setting than previous works [9], [15], [60], we performed a special split on the ADLEgoDataset. Specifically, the test split not only considers multiple seen users and their unseen day sequences, but also different unseen users during training.
We first detail the dataset split in Section IV-A. Next, in Section IV-B, we outline the implementation details of the state-of-the-art algorithms and their bidirectional counterparts. Finally, we discuss the evaluation metrics and results of our experiments in Section IV-C.

A. DATASET SPLIT
Our goal in doing the training and testing partitions was to make possible the evaluation of generalization capabilities of several state-of-the-art algorithms on the ADLEgoDataset. In comparison with previous works [15], [59], we did not randomly and proportionally split each category of the data. Indeed, this kind of training/testing partition is not reliable on sequential data since consecutive frames depicting similar information might be present in both partitions. Therefore, instead of hiding single random frames from the training split, we selected in a test split full-day sequences from seen users during training. This selection was made as proportional as possible with respect to the categories since it had to be representative of the dataset. In contrast with [14], we considered that this kind of partition is not enough to assess the generalization performance, because similar days might depict similar activities in the same context of a person. Consequently, we made another test split consisting of unseen users during training. This test split was not constrained to be representative of the training split. The data percentage of the seen and unseen test users was around 10% and 5%, respectively. Moreover, in this experiment we discarded the activity categories that had less than 200 instances or that were performed by only one user, except for four categories (airplane, cleaning, gym, and pets). These categories were also considered for further comparisons on the experiments in Section V.
We first created the unseen users split because it reduced the complexity of the seen users split. The procedure is detailed as follows:

1) UNSEEN USERS SPLIT
First, we calculated all the possible combinations of unseen users from the 15 users (i.e. 32,767) by using the Twiddle algorithm [85]. Then we calculated the total number of images for each combination, and filtered the ones that did not have between 4.5% and 5% of images from the total amount of images in the ADLEgoDataset. Finally, we selected the combination with the lowest number of participants.

2) SEEN USERS SPLIT
This split is focused on separating complete days of images (or full-day sequences) from users, rather than separating users. A full-day sequence is composed of several images with different activity labels from one user. The objective of this test split is to separate full-day sequences from the training that maintains a similar category distribution as the whole dataset and thus being representative of what it is intended to learn. We measure the similarity between category distributions using the Bhattacharyya distance.
After removing the unseen users from the dataset, the remaining number of users is 9 and their number of fullday sequences is 103. By counting the number of images from each full-day sequence, 10% of the dataset for the split is obtained by selecting between 6 and 32 full-day sequences. We considered that the most representative fullday sequences are the ones with the closest category distribution with respect to the whole dataset. Consequently, finding it involves comparing the category histograms between the whole dataset and all possible combinations of full-day sequences. Although the number of test days is low, the search is prohibitively expensive as is characterized by combinatorial growth. For instance, the number of test sets considering 6 days out of the 103 is ≈ 1.42 × 10 9 , but for 32 days out of the 103 is ≈ 4.42 × 10 26 . Finding the best full-day sequences for the split was performed as a two-step optimization search using heuristics. With the goal of reducing the search space, instead of dealing with single full-day sequences, we first grouped them into bins of one or more full-day sequences. This was modeled as a bin packing problem, where the objects were full-day sequences, and their weight was its number of images. To further reduce the search space by half, these bins were matched in pairs with similar category distributions. The idea is that one bin was destined for the training set and the other for the test set. The resulting number of bin pairs was 32 containing between one and two full-day sequences.
The second step evaluated all test split candidates to find the most similar to the ADLEgoDataset distribution. A test split candidate is a combination of bin pairs that contains all activity categories and its number of images is approximately 10% of the data. The distributions of all our final splits are depicted in Fig. 4. It shows that all split distributions have a similar shape, except for the unseen users split because it considered random users as described above.

B. RANKING IMPLEMENTATION 1) STILL IMAGE LEVEL
We trained the following two models for static image level classification: 1) CNN. We used ResNet-50 [20] as CNN network and replaced the top layer with a fully-connected layer of 28 outputs. The fine-tuning procedure used Stochastic Gradient Descent (SGD) and a class-weighting scheme based on [86] to handle class imbalance. Moreover, the last ResNet block and the only fully connected (FC) layer were unfrozen. The CNN initially used the weights of a pre-trained network on Ima-geNet [87]. It was trained during 7 epochs using a learning rate α = 1 × 10 −2 , a learning rate decay of 5 × 10 −4 , a momentum µ = 0.9, and a weight decay equal to α = 1 × 10 −3 . 2) CNN+RF. Two random forests were trained using the output of different layers from the previously described ResNet-50 network. Specifically, the first RF was trained using as input the features extracted from the average pooling layer. The other RF uses the average pooling layer plus the concatenation of the FC layer. The number of trees was set to 500 and used the Gini impurity criterion [88].

2) IMAGE SEQUENCE LEVEL
The following outlined models take into account temporal information and use as backbone the previously trained models. We used as temporal architectures the LSTM and BLSTM networks. Following [9], the training of each model was performed in two ways by treating differently an input day lifelog sequence. The first training strategy operates directly over a day lifelog, i.e. over the full day sequence. The second training strategy truncates a day lifelog sequence into fixed-size subsequences in a sliding window fashion. With the purpose of making a fair comparison, their weights and outputs of the backbone models were frozen during training. All the models were trained using the SGD optimization algorithm using different learning rates but the same momentum µ = 0.9, weight decay equal to α = 5 × 10 −6 , batch size of 1, and a timestep of 5.
1) CNN+LSTM and CNN+BLSTM. These models removed the top layer of the ResNet-50 network and respectively added a LSTM and BLSTM layer having 256 units, followed by a fully-connected layer of 28 outputs. For both models, the learning rates of the full sequence and the sliding window training were α = 1 × 10 −2 and α = 1 × 10 −3 , correspondingly. 2) CNN+RF+LSTM and CNN+RF+BLSTM. Both models were trained using as input the prediction of the best CNN+RF model, namely the combination of the avg. pooling and the FC layers. These models respectively added an LSTM and BLSTM layer having 30 units, followed by a fully-connected layer of 28 outputs. The learning rate for both models and types of training was α = 1 × 10 −3 .

C. RANKING EVALUATION
The model performance was evaluated using the accuracy, the mean average precision (mAP), and macro metrics for precision, recall, and F1-score. Using the accuracy as the only classification metric might be misleading under the class imbalance present in both test splits. The purpose of using these macro metrics is to offer a more solid comparison baseline. Table 3 shows the performance of all the static and temporal models on the seen and unseen test partitions. The best models for the seen and unseen test splits were CNN+BLSTM (80.64%) and CNN+LSTM (79.87%), respectively. In both test splits, the sliding window training resulted in better performance. Although both models achieved a similar accuracy on the test splits, the rest of the metrics remained significantly different. This indicates that the CNN+BLSTM model suffers from overfitting on unseen users. Overall, the best model for both test splits was the CNN+LSTM achieving an 80.12% accuracy, as it had a similar performance on the seen users split, and better performance on the unseen users split.
In contrast with the results previously obtained in [14], our experiments indicate that the CNN+RF models decreased the overall accuracy of the ResNet-50 network. Considering both test splits, the macro precision improved whereas the macro recall decreased. Thus, indicating that the CNN+RF models are confident in their predictions, but they miss a large number of class samples. Consequently, both temporal models trained on top of this configuration (CNN+RF+LSTM and CNN+RF+BLSTM) have a decreasing score in all the considered metrics with respect to the CNN baseline. This is likely due to the fact that here we are using another dataset (NTCIR-12 [13], [14]) and an unseen users split in our test set.
The confusion matrices of the best CNN+BLSTM and CNN+LSTM models for the seen and unseen test splits are illustrated in Fig. 5. A straight comparison of all classes between each test split cannot be made, as the number of test samples is different and it might be misleading. For instance, not all categories appear on the unseen test split like airplane or watching tv. Additionally, the proportion of the number of test samples is less in some classes, e.g. stairclimbing.
Nevertheless, a comparison between the results of each temporal model and the CNN model can be done by calculating their difference, as shown at the right of each confusion matrix row in Fig. 5. Since the accuracy improvement with respect to the baseline is higher on the unseen than on the seen test split, there are more changes in its difference. Moreover, the plots show low performance on the CNN model for the categories Cleaning, Relaxing, Drinking, and Writing. They might be due to the large intra-class variability of the category (Relaxing), the social context ambiguity (Formal and Informal meeting), and to the fact that same activities occurs on very similar places (Cleaning, Cooking and Dishwashing). Further results containing the recall scores for each class on both test splits are reported in the Appendix A.

V. GENERALIZATION TO OTHER DOMAINS
In real-world applications, a system pretrained on a large scale dataset is typically used on new visual unseen lifelogs during training, belonging to previously unknown users. 77352 VOLUME 8, 2020 FIGURE 5. Normalized confusion matrices of the best models for the seen and unseen test sets and their difference with respect to the CNN model. The increase and decrease of confidence is represented by the intensity of red and blue colors. Note that the classes Airplane, Cleaning, Going to a bar, Gym, and Watching TV do not appear on the unseen users test set. The images composing these lifelogs might have been recorded from different cameras than the one used to capture the training dataset. For instance, Fig. 6 shows egocentric images of different people washing the dishes in their houses captured with three different wearable cameras. Besides the visual variability of tap and sinks in different kitchens, one can notice the contrast of fields of view and the angle distortion produced by different lenses. Due to the different nature of the source and target domains, performance on the target domain typically experiences a drop.
In this section, we aim at mitigating the performance drop by applying a semi-supervised learning technique, namely domain adaptation (DA). Our goal is to assess the performance between egocentric domains with and without transfer learning, rather than proposing a new adaptation method tailored at egocentric image sequences. Therefore, we strictly focus on a simple image-based DA method, the Deep Correlation Alignment (CORAL) regularization loss [12]. We perform two experiments using the ADLEgoDataset as the source domain, and the NTCIR-12 [13] and Castro et al. [15] datasets as target domains. These datasets were selected as target domains, as they are the closest to our dataset in number of activity categories and annotated images, and were recorded with different camera, as it can appreciated on Table 1. In the first experiment, we measure the performance of adding annotated images from different domains for training without using DA, and we quantify the difference between the target and the source domains. In the second experiment, we use the CORAL loss function as DA method on the target datasets and calculate the amount of labeled target data needed to achieve a good classification performance. VOLUME 8, 2020 FIGURE 7. Domain adaptation training pipeline. During training, two CNNs with shared weights are used for the source and target data domains, respectively. Since the target domain labels are unknown, only the classification loss for the source CNN is evaluated. The adaptation from the source to the target domain comes from penalizing the discrepancy of their predictions using the domain adaptation loss. In this example, the discrepancy of both images should be high, because the source and target images correspond to the classes eating and driving.
In Section V-A, we detail the domain adaptation technique we used, namely the CORAL regularization loss. Next, in Section V-B we outline the datasets we used and their splits on the two experiments. In Section V-C, we thoroughly describe the implemented models. The experimental results evaluation and discussion are presented in Section V-D.

A. DOMAIN ADAPTATION USING A REGULARIZATION LOSS
Let L S = {y i }, i ∈ {1, . . . , L} be the labels from the source domain, and let us assume that the target domain has only unlabeled examples. During training, both domains have their own CNN architecture with shared weights, but only the source domain has a classification loss CLASS . In order to adapt the learned model from the source to the target domain, a regularization loss DA is used. This domain regularization loss penalizes the discrepancy between the output distributions from two single feature layers having a dimension d. This is a common setting used in [12], [65], [70] and it is illustrated in Fig. 7, where a single DA loss is penalizing the output of the fully-connected (FC) layers. The training loss function can be expressed as: where n is the number of DA regularization layers in the network and λ denotes the hyperparameter that trades off the adaptation with classification accuracy. Since our CNNs only had one FC layer, we only used one DA loss. Specifically, we used the CORAL regularization loss [11], [12]. One of its advantages is that only the hyperparameter λ requires to be set. In this context, the output features of the source and target layers are said to come from the source domain D S = {x i }, x ∈ R d and the target domain D T = {u i }, u ∈ R d , respectively. Then the CORAL regularization loss can be defined as: where · 2 F denotes the squared matrix Frobenius norm and C is the covariance of D given by: where m is the number of data in the domain D and 1 is a column vector with all elements equal to 1. The CORAL loss penalizes the discrepancy between domain features, so that when the source and target images correspond to different classes the penalty is high.

B. SOURCE AND TARGET DATASETS DETAILS
In our experiments, we used the ADLEgoDataset as the source domain dataset, and the NTCIR-12 [13] and Castro et al. [15] as target domain datasets. Both datasets were selected as target domains since they used different cameras and have more annotated categories and images than other lifelogging datasets, as can be appreciated in Table 1. Additionally, the domain visual difference with respect to our dataset can be appreciated in Fig. 6. We did not consider using the NTCIR-12 [13] and Castro's datasets as source domains since they have fewer people, half of the images, and fewer activity categories. Since their labels correspond to a different set of activity categories than ours, we manually mapped the matching categories. More categories would have required an automatic matching between words. The resulting categories and data distributions are shown in Fig. 8. The corresponding number of images of the source and the target for the NTCIR-12 was 96,632 and 44,902, and for the Castro's dataset was 68,507 and 39,166. The specific data splits for each experiment are detailed below.

1) TRAINING WITHOUT DOMAIN ADAPTATION
The goal of this experiment was to measure the performance of adding images from two different domains only during training without using DA. Therefore, we combined the source dataset with each target dataset for the training and validation splits, but the testing split only considered images from the source domain. Explicitly, we used the same splits for the source images as described in Section IV-A. The images from the target domains were randomly stratified in a 90/10% proportion for the training/validation splits.

2) DOMAIN ADAPTATION ON THE TARGET DATASETS
The objective of this experiment was to (i) use transfer learning in a practical setting and (ii) determine the required amount of labeled data from the target domain to obtain a good classification performance. The initial setting of the experiment considered that only the source domain data was labeled, but later different proportions of labeled data from the target domain were added. First, we randomly stratified the source data into training and validation sets, and the target data into training and testing sets. Throughout the experiment, the proportion of training and validation data of the source images was fixed and set to 90/10%, whereas the proportion of training and testing data of the target images was initially set to 85/15%. Subsequently, different proportions (10, 20, . . . , 50%) of images were randomly and incrementally removed from the target training split. These images were added to the training/validation splits of the source domain while maintaining their original 90/10% proportion. . Sensitivity of the CORAL distance due to different learning rates using the Xception network. These results were obtained by only measuring the CORAL distance and not penalizing it (i.e. by having a fixed λ = 0).

C. GENERALIZATION EXPERIMENTS IMPLEMENTATION
The following paragraphs describe the training settings for each experiment.

1) TRAINING WITHOUT DOMAIN ADAPTATION
We used a ResNet-50 [20] network as a CNN model and replaced its top layer with a fully-connected (FC) layer of 28 outputs. In order to have comparative results with the classification baseline of Section IV, we explicitly used the same network. It was trained using Stochastic Gradient Descent (SGD) with its weights initialized on ImageNet [87]. The last ResNet block and the FC layer were unfrozen during fine-tuning procedure. The training parameters were a learning rate α = 1 × 10 −2 , a learning rate decay of 5 × 10 −4 , and a momentum µ = 0.9. Since we used two validation splits, the training was stopped when their epoch losses were not further improved. The number of epochs for the target datasets NTCIR-12 and Castro's one were 6 and 9, respectively.
Domains Discrepancy: As a means to quantify the difference between the source dataset and the target datasets, we calculated the maximum mean discrepancy (MMD) [89] between them for each shared category. First, we sampled between 500 and 1,000 images per category that were both in the source and the target datasets. These sampled images took into account all users and all days. Then, for each sampled image, we extracted a feature vector from the last pooling layer of a ResNet-50 CNN pre-trained on ImageNet [87]. Finally, we calculated the MMD between the sets of feature vectors of the source and target datasets using a Gaussian kernel with a σ = 0.1.

a: ARCHITECTURE SETUP
In comparison with AlexNet, the Xception and ResNet-50 architectures have only one FC layer, making it the only layer suitable for the CORAL loss. The weights of this FC layer were initialized with N (0, 0.005) and its learning rate was set ten times bigger than the other layers, as stated in [12]. The rest of the layers were initialized using pre-trained weights on ImageNet [90]. We initially kept frozen all the layers except the classification layer, but it had a negative impact on the performance in the target domain. Hence, the layers from the last ResNet block of the ResNet-50 architecture and the exit flow block of the Xception architecture were unfrozen. We used SGD as an optimization method for both networks.

b: LEARNING RATE α TUNING
We experimentally found that an adequate learning rate α had to be high enough to produce a significant CORAL distance between the source and the target domain, but not so high that it did not converge. In order to find it, we first varied the learning rates while maintaining the other parameters constant and setting λ = 0. In other words, the training was performed without penalizing the discrepancy between domains, but measuring their distance. For instance, Fig. 9 illustrates significantly different CORAL distances for two different learning rates on both target datasets. In both cases, the highest learning rates were used as their training converged. Additionally, in our experiments, the lower learning rate did not produce higher accuracy scores for the training split of the target domain.
The final training parameters for ResNet-50 were a learning rate of α = 5 × 10 −3 , a batch size of 60, a momentum equal to 0.9, and a weight decay equal to 5×10−4. Additionally, the training parameters for the Xception network were a learning rate of α = 5×10 −2 , a batch size of 40, a momentum equal to µ = 0.9, and a weight decay equal to 5 × 10 −4 .

c: CORAL LOSS WEIGHT λ TUNING
After finding an adequate learning rate, we trained the ResNet-50 and Xception networks for λ = 0, 0.1, . . . , 1. The best value of λ was obtained considering only the highest validation accuracy of the source domain, as the target data is supposed to be unknown. The best values of λ for ResNet-50 were 0.3 and 0.5 on the NTCIR-12 and Castro's datasets, respectively; whereas the best values of λ for Xception were 77356 VOLUME 8, 2020  [15], [59] and NTCIR-12 [13], [14] datasets for training without domain adaptation. Best result per measure is shown in bold. Note that not all categories appeared on the unseen users test set. FIGURE 11. Maximum mean discrepancy (MMD) between the categories from the source and target datasets. The closer the value to zero the more similar the domains for that category. 0.5 and 0.2 on the NTCIR-12 and Castro's datasets, correspondingly.
The validation accuracy plots for ResNet-50 and Xception networks on both datasets are shown in Fig. 10. Two observations can be made from these plots. First, the areas between the minimal and maximal values of the accuracy obtained using the different values of λ suggest that the training of Xception network is more unstable than the ResNet-50 network. Consequently, no further experiments were implemented using the Xception network. Second, the difference between the target accuracy of both datasets (≈ 73.22% for Castro and ≈ 47.92% for NTCIR-12) shows that a good performance is not always achieved using the CORAL loss alone. Therefore, more data from the target domain is needed to be labeled during training.

d: ADDITION OF TARGET LABELED DATA TO THE SOURCE DOMAIN
After fine-tuning the hyperparameters, we separately trained the ResNet-50 network adding different percentages of random target labeled data to the source domain. The considered percentages of target data were 0, 10%, . . . , 50% and were selected as described in Section V-B.

D. GENERALIZATION EXPERIMENTS EVALUATION 1) TRAINING WITHOUT DOMAIN ADAPTATION
The objective of this experiment was to (i) measure the activity classification performance when mixing the source and target datasets during training without DA method and (ii) estimate how different were the source and target domains. Given the class imbalance present in the dataset and for comparative purposes, we used the same performance metrics as the experiments presented in Section IV-A. The discrepancy between shared categories of the source and target domains was calculated using the MMD as described in Section V-C.
The classification results of separately adding Castro's and NTCIR-12 datasets for training are presented in Table 4. It shows that the addition of labeled data from the target domains diminished all the evaluated performance metrics; in particular, the accuracy was lower by 13.71% on average. VOLUME 8, 2020  The overall classification performance of adding Castro's dataset was better than when adding the NTCIR-12 dataset. This is also reflected in their calculated discrepancy with respect to the source domain. Fig. 11 shows the MMD for each shared category between each target and source domains, and a horizontal line representing its mean. The MMD mean for the Castro's dataset is lower than for the NTCIR-12 dataset, thus meaning that it is more similar to the ADLEgoDataset. This difference in discrepancy also is reflected in the performance of domain adaptation as describe below. Supplementary results containing the recall scores for each class are reported in the Appendix B.

2) DOMAIN ADAPTATION ON THE TARGET DATASETS
The objective of this experiment was to use transfer learning on a practical setting and to determine the required amount of labeled data from the target domain to obtain a good classification performance. As in previous works [12], [65], [70], [73], [91], [92], we use the prediction accuracy as evaluation metric for five different training runs. Our results only consider ResNet-50 architecture, since the training of the Xception network was unstable as discussed above. The summarized results are shown in Table 5 and plot in Fig. 12.
The results in Table 5 show that ResNet-50 was also susceptible to instability during training, producing a high variance in some training runs. This instability only affected Castro's dataset and can be visually seen in the plot of Fig. 12. Therefore, the accuracy median was also considered to measure performance improvement.
The results confirm that performing domain adaptation without using labeled target data does not necessarily achieve a good performance on all target datasets. Specifically, the median accuracy of the NTCIR-12 was 45.14% whereas for Castro's dataset was 72.58%. This low performance was improved by adding a small subset of labeled target data to the training. The largest increment in median accuracy was obtained by adding 10-20% of labeled data, i.e. for the NTCIR-12 it improved by 33.32% when adding 10% and for Castro's dataset it improved by 6.57% after adding 20%. The most benefited dataset was the NTCIR-12 since their initial discrepancy was higher as shown by the previous experiment. The mean and median accuracy curves from Fig. 12 show a decreasing increment that settles around 40%. Although a straight comparison with previous works cannot be made [14], [15], the mean accuracy values at 40% of added data are competitive. Originally, the accuracy obtained for Castro's and NTCIR-12 datasets were 83.07% and 94.08%, correspondingly.

VI. CONCLUSION
We introduced the so-far largest egocentric lifelog dataset of activities of daily living consisting of 105,529 annotated images, the ADLEgoDataset. It was recorded by 15 different participants wearing a Narrative Clip camera while performing 35 activities of daily life in a naturalistic setting during a total of 191 days. With respect to other available lifelog datasets, it contains many more categories, annotated images, users and types of activities, hence allowing to perform generalization tests on unseen users.
We presented a strong classification baseline on our dataset that considers a more realistic comparison by not only testing VOLUME 8, 2020  [15], [59] and NTCIR-12 [13], [14] datasets without domain adaptation. Best result per measure is shown in bold. Note that not all categories appeared on the unseen users test set.
on unseen days but also on unseen users. This baseline was done using existing state-of-the-art algorithms on it, which also served as a ranking of their generalization capabilities. The best algorithm achieved an 80.12% of accuracy and was the CNN+LSTM trained in a sliding window fashion.
Moreover, we presented experiments of generalization in different domains. We first showed that the evaluated source and target datasets have a large discrepancy that diminished the classification performance by 13.71% on average. Finally, we used the CORAL loss function as a DA technique and showed that a good performance is not always achieved on different target datasets. Specifically, we obtained a median accuracy value of 72.47% and 45.14% on Castro's and the NTCIR-12 datasets. We also showed that the performance can improve by incorporating a small percentage of labeled target data to the training. In the case of the NTCIR-12 dataset, the performance improved to 78.46% by randomly adding 10% of target data.
We consider that further research lines using this dataset are twofold. First, taking into account the ambiguity of the context, the activity recognition problem from lifelogs could be posed more naturally as a multi-classification problem. For instance, a person might be reading a book while being on train. Second, we only considered full day sequences on the temporal classification algorithms, but splitting them into sub-sequences with higher temporal coherence could improve the classification accuracy. Moreover, we consider that activity recognition from wearable photo-cameras, in conjunction with information coming from more sensors, is mature enough to be tested in real-world applications. These applications could come from different domains, for instance, the assessment of several activities of daily living for the elderly or for monitoring the wellbeing of young people.
Although the people in our dataset have different lifestyles and hobbies, their activities reflect the life of computer science graduate students. We consider that a dataset captured by users having different jobs would help to cope better with real-world scenarios. For instance, a construction worker would have a routine with different activities and settings.

APPENDIXES APPENDIX A CLASSIFICATION RECALL FROM DATASET BASELINE
In Tables 6 and 7 are shown the classification recall scores for ADLEgoDataset baseline from Section IV. These Tables reflect the results obtained measuring the macro metrics, i.e. the best performance for the seen users test split is obtained the CNN+BLSTM method, whereas for the unseen users test split is obtained the CNN+LSTM method. Additionally, both Tables show that the best training strategy is the sliding window.

APPENDIX B CLASSIFICATION RECALL FROM GENERALIZATION WITHOUT DOMAIN ADAPTATION EXPERIMENT
In Table 8 is shown the classification recall scores for generalization experiments without domain adaptation from Section V. Although the performance was diminished in overall metrics, some categories were benefited. The improved categories for the NTCIR-12 dataset were cooking, drinking, informal meeting, and shopping. In the case of the Castro's dataset, only two categories improved their accuracy: dishwashing and going to a bar. The latter category was not present in any of the target images. This table also shows that overall performance of adding the Castro's dataset was better than the NTCIR-12 dataset.