Cardiorespiratory Parameters Monitoring Through a Single Digital Camera in Real Scenarios: ROI Tracking and Motion Influence

Monitoring of heart rate (HR) and respiratory rate (<inline-formula> <tex-math notation="LaTeX">${f}_{R}{)}$ </tex-math></inline-formula> is fundamental to assessing the health status of an individual. To address this scope, technologies that frame the upper body and the face regions of a subject without any physical contact can be used. Motion artifacts can affect the applicability of non-contact methods to the continuous monitoring of these parameters as well as the computational burden. This article focuses on a technique based on images captured with a single digital camera for the continuous estimation of HR and <inline-formula> <tex-math notation="LaTeX">${f}_{R}$ </tex-math></inline-formula>. The main goal is to analyze how the velocity of facial movements and region of interest (ROI) tracking duration influence the performance of the method. Tests were performed on healthy volunteers during motionless trials (i.e., at rest and after exercise), head and torso movements, and during physical exercise. Results demonstrated that a continuous estimation of HR and <inline-formula> <tex-math notation="LaTeX">${f}_{R}$ </tex-math></inline-formula> can be performed with acceptable errors under changing ROI tracking duration and velocity in motionless trials (mean absolute error (MAE) below 5 bpm and 3.42 breaths<inline-formula> <tex-math notation="LaTeX">$\cdot $ </tex-math></inline-formula>min<inline-formula> <tex-math notation="LaTeX">$^{-{1}}$ </tex-math></inline-formula> for HR and <inline-formula> <tex-math notation="LaTeX">${f}_{R}$ </tex-math></inline-formula>, respectively), whereas during movements (mimicking head and torso movements, and during exercise) the error increases (MAE up to 5.42 bpm and up to 5.03 breaths<inline-formula> <tex-math notation="LaTeX">$\cdot $ </tex-math></inline-formula>min<inline-formula> <tex-math notation="LaTeX">$^{-{1}}$ </tex-math></inline-formula> for HR and <inline-formula> <tex-math notation="LaTeX">${f}_{R}$ </tex-math></inline-formula>, respectively). The proposed investigation can provide a framework for the continuous estimation of HR and <inline-formula> <tex-math notation="LaTeX">${f}_{R}$ </tex-math></inline-formula> during both static and dynamic activities by optimizing the ROI tracking duration under different velocities of facial movements.

health status of an individual. These physiological parameters are included in the Early Warning Scores (EWS), which are indicators used to determine the severity of the patient's physiological deterioration in the hospital [1]. HR and f R are the primary physiological signs to be monitored since their abnormal values can be an index of some important diseases (e.g., cardiac arrest and cardiovascular diseases) [2], [3]. Moreover, the influence of f R and HR, also known as cardiorespiratory coupling (CRC) plays an important role in the assessment of sleep-related disorders [4].
Commonly, contact-based systems which require the direct contact of the sensor with the skin of the subject are employed to measure HR and f R [5]. In this context, wearable technologies based on different kinds of sensors have been widely used [6], [7]. However, contact-based technologies present some drawbacks related to discomfort for the subjects, ease of loss of contact, skin irritation in patients with fragile skin, and lastly the need for dedicated instrumentation [8]. Thus, in recent years non-contact technologies (e.g., radar and depth cameras) have gained a lot of interest due to their ease of use, no discomfort for the subject during the monitoring [9], [10], [11], [12], [13]. Among them, digital cameras (e.g., smartphone's built-in camera, laptop built-in camera) present several advantageous features such as costs, availability, portability, and easiness of use [14], [15], [16]. Commonly, RGB sensors can be employed to detect the remote photoplethysmographic (r-PPG) signal associated with volumetric changes of blood in facial capillaries [17]. Both HR and f R can be estimated from the r-PPG signal, as the respiratory activity modulates the cardiac activity [18]. When the r-PPG technique is used to retrieve information about the pulsatile activity through a video of the subject's face, different factors must be considered, like lighting source [19], usercamera distance [20], the resolution of the device used to acquire the video [21], the region of interest (ROI) [22], [23], and motion artifacts [24], [25], [26].
Head rotations and whole-body movements must be detected and deleted to improve the accuracy of vitals monitoring and possibly to extend the applicability of video-based systems outside constrained environments (e.g., during sports activities and in real scenarios). In the last years, some studies proposed different techniques to detect movements and to make a robust contactless system, mainly for HR estimation [27], [28], [29], [30]. Only a few studies have investigated the influence of respiratory-unrelated movements on the estimation of f R [31], [32]. To the best of our knowledge, most of the studies available in literature detect body movements and try to filter, compensate, or remove them by implementing complex signal processing techniques (e.g., adaptive filtering) or through deep learning methods [28], [33], [34]. Another approach is to use the motion information obtained from a body or face tracking to filter or compensate for motion [35]. For example, Guo et al. [36] used a near-infrared time-offlight camera to detect cardiac and respiratory unrelated movements and compensate for the intensity variation caused by motion artifacts using depth information. Another study applies adaptive noise cancellation due to motion artifacts in combination with a modified HSI-hue, saturation, intensitymodel to remove motion artifacts and reduce the effects of irregular intensity induced by head movements [33]. However, to the best of our knowledge, none of the studies available in the literature have attempted analyses to quantify the amplitude of body movements and thus compensate for them, except for the recent work of Wu et al. [30] in which a motion level has been computed to identify the magnitude of simulated movements (i.e., stationary, small, and full motion). Strictly related to the detection of motion artifacts is the tracking of ROIs identified to extract the pulsatile and respiratory patterns [25], [37]. However, an ROI tracking optimization analysis based on the investigation of how many seconds to conduct the ROI tracking has not been performed. Additionally, the reliable estimation of HR and f R simultaneously is uncommon, mainly because only r-PPG signal does not provide a robust measure of f R . Moreover, most of the studies focus on the average estimation of HR and f R [38], [39], which can discourage the application of non-contact technologies when abrupt increase or decrease in HR and f R are expected (e.g., sports field) or when more data are needed for short-term recordings.
To tackle these drawbacks, in this article, we focus on the analysis of how facial movements and ROI tracking duration influence the performance of a method based on images captured with a single digital camera for the estimation of HR and f R . We aim at providing a framework for the continuous estimation of these two parameters during both static and dynamic activities by optimizing ROI tracking and the velocity of facial movements. The contributions of our article are the following: 1) A ground-truth independent algorithm is proposed for the continuous estimation of both HR and f R with an update time of 1 s, even in the case of body motion disturbances. 2) A quantitative analysis of cardiac and respiratory unrelated movements based on the velocity of facial movements. 3) An analysis of the ROI tracking duration to optimize it under different velocities of facial movements. The article is structured as follows. Section II describes the submodules of the proposed framework. In Section III, the performed experiments on healthy volunteers during motionless trials (i.e., at rest and after exercise), head and torso movements, and during physical exercise to assess the performance of the proposed framework and the evaluation metrics are described. Section IV reports the obtained results, followed by the conclusion in Section V.

II. METHODS
The architecture of the proposed pipeline for the continuous estimation of f R and HR under different motions consists of submodules including: 1) video pre-processing; 2) signal extraction; 3) signal analysis; and 4) estimation of f R and HR. All the steps to estimate f R and HR are described in Sections II-A-II-C.

A. Video Pre-Processing
A video of the face and the torso regions was recorded through a digital camera. The facial region was detected in the video with the pre-trained cascade classifier from the OpenCV library [40] in a Python environment. Sixty-eight facial landmarks were detected and tracked through the dlib detector, which maps the facial points on the subject's face [41].
In this work, three rectangular ROIs on the face-i.e., right cheek, forehead, and left cheek since these facial regions have good vascularization [42]-and one on the torso at the level of the jugular fossa were identified to extract the raw r-PPG and the respiratory signals, respectively (see Fig. 1). Specific landmarks automatically identified on the face were used to define geometric rules to construct the ROIs. The tracking of all the identified ROIs was carried out under changing ROI tracking duration to evaluate its influence on the performance of the proposed framework in the estimation of cardiorespiratory parameters and the computational burden during video processing in terms of CPU running time. Five modalities of video analysis were implemented: 1) each frame; 2) every 30 frames-1 s; 3) every 150 frames-5 s; 4) every 300 frames-10 s; and 5) every 600 frames-20 s. For f R estimation an additional analysis was performed by using a fixed ROI identified in the first frame of the video [16], [43], [44]. The raw RGB signals extracted from the ROIs on the face were used to retrieve the pulsatile-related signal, whereas the respiratory signal was extracted through a method based on the computation of the optical flow (OF) [44]. Contextually, the displacement of the landmark on the nose was retrieved from the video and then the velocity of this landmark has been calculated. This value is then used to define the threshold on the head movements.

B. Signal Extraction
The raw r-PPG signals were obtained from each ROI on the face by spatially averaging the intensity of the pixels in the three-color channels R, G, and B, corresponding to the red, green, and blue channels according to the following: where x ROI represents the number of pixels in the selected ROI region along the y-axis (see Fig. 1), c is the color channel, and I(x, y) is the intensity component of each channel. The respiratory signal was extracted through a method based on the computation of OF, which allows the estimation of the displacement between two consecutive images by tracking image features on a pixel-by-pixel basis [44], using the Horn and Schunck (HS) algorithm [45]. A velocity vector for each pixel in the image was obtained and the component along the y-axis was assumed as the most related to movements of the ribcage caused by breathing [44].

C. Signal Analysis
The analysis of raw r-PPG and respiratory signals was carried out in a MATLAB environment through a windowing analysis. A window length of 20 s was chosen as it represents a good compromise between resolution and noise robustness [16], [46], moving every 1 s. In the same analysis, the f R and HR traces were obtained. Fig. 2 reports the flowchart illustrating the steps for the respiratory and r-PPG signals analysis when considering the threshold on the head movement, which is computed as the velocity of the nose's landmark.
1) Respiratory Signal Analysis: In each identified window, the raw respiratory signals computed with OF were first filtered with a Butterworth bandpass filter between 0.1 and 0.8 Hz (equivalent to the physiological range of f R 6-48 breaths·min −1 ). Then, an analysis in the frequency domain was carried out by computing the power spectrum density (PSD) of the signals. f R values were estimated by considering the maximum frequency at which occurs the highest value of the PSD multiplying by 60, obtaining a value of f R in bpm in each window. These steps were carried out per each modality of video analysis and when considering the threshold on the velocity.
2) r-PPG Signal Analysis: In each window, all the raw RGB signals were filtered through a Butterworth bandpass filter between 0.7 and 3 Hz (equivalent to the physiological range of HR 42-180 bpm). We implemented five well-established r-PPG algorithms that can be used when performing tests with digital cameras [47]: 1) green channel (GC); 2) independent component analysis (ICA); 3) principal component analysis (PCA); 4) chrominance-based signal processing method (CHROM); and 5) plane orthogonal to skin (POS) [48]. This analysis was carried out for all the raw r-PPG signals extracted from the three identified ROIs with ROI tracking in each frame of the video. After identifying the best algorithm to retrieve the r-PPG signal, this algorithm was implemented for HR estimation per each implemented modality of analysis (i.e., each frame, every 1, 5, 10, and 20 s) and when considering the threshold on the movement's velocity of the head, according to the flowchart in Fig. 2. HR estimation was performed through an analysis in the frequency domain by computing the PSD of the signals. After the normalization of the PSD against its maximum peak, HR was calculated by considering the average of the frequencies at which occur the peaks of the PSD above a threshold value of 0.8, weighted against the values of those peaks, according to the following: where N is the number of peaks in the PSD above 0.8, P i is the value of the PSD in the range (0.8, 1], and f i represents the value of the frequency at which occur P i . Then, the estimated values were multiplied by 60 to obtain a value of HR in bpm. The HR values obtained per each identified ROI (i.e., left cheek, right cheek, and forehead) were numerically averaged. The mean value of HR was used for further statistical analysis [16].
3) Motion Threshold: To evaluate the performance of the proposed framework in the estimation of cardiorespiratory parameters under different head and torso movements, motion thresholds were identified by considering the velocity of the nose's landmark (see Section II-C). As reported in Fig. 2, the nose displacement was obtained in all the frames of the video and the velocity (v) was computed as v = s/t, where s is the nose displacement in pixels and t is the time. An analysis under gradually changing velocity thresholds from 5 to 400 pixels/s with a step of 5 pixel/s was carried out (the relative results are not reported). We considered just seven threshold velocities to summarize the results: 50, 60, 100, 150, 200, 300, and 400 pixels/s, equivalent to approximately 2, 2.4, 4, 6, 8, 12, and 16 cm/s, respectively. These values were obtained considering the region covered by the face in the whole image, which is about 10% in all the videos recorded. The maximum velocity value was calculated in each window and compared to the threshold to execute the next steps as shortly reported in Fig. 2.

III. EXPERIMENTAL TRIALS
To test the performances of the proposed framework in the continuous estimation of f R and HR simulating real conditions with motion artifacts (e.g., head rotations and torso movements) and during physical activity, we carried out experiments on healthy volunteers. The study was carried out according to the Declaration of Helsinki, and in compliance with the Ethical Approvals received from Università Campus Bio-Medico di Roma (ST-UCBM 14/22 OSS).

A. Dataset in Laboratory Environment
This article aims at exploring the use of digital cameras in the presence of motions, focusing on the analysis of how facial movements and ROI tracking duration influence the performance of the proposed framework in the continuous estimation of cardiorespiratory parameters. Thus, we first create a dataset using a digital camera setup in a laboratory environment consisting of the following: 1) Camera: The CANON EOS 1100D camera (Canon, USA) was used to record two videos of the face and the torso of the subjects in a laboratory environment, with a resolution of 1280 × 720 pixels and an acquisition frequency of 30 frame per seconds (fps). The camera was placed at about 0.5 m from the subject. The face of the subject covers ∼10% of the whole image with a size of 257 × 361 pixels. 2) Reference Instrument: The multiparametric wearable device, the Zephyr BioHarness v3 by Medtronic was used to record the electrocardiographic (ECG) and the respiratory waveforms. The device consists of a thoracic belt and an electronic module. It acquires the user's respiratory pattern by detecting the volumetric changes in the thorax through a strain gauge and the ECG waveform via dry electrodes. The reference respiratory and ECG signals were collected at 25 and 250 Hz, respectively. 3) Subjects: A total of 20 healthy subjects (ten males and ten females, aged between 22 and 31, Fitzpatrick Skin Phototype between II and III) were enrolled for the recordings. Each volunteer was called to perform two video collections. During the first video, the subject was asked to perform: 1) ∼120 s of quiet breathing (QB); 2) ∼90 s of tachypnea; 3) ∼90 s of small head rotations left and right along the craniocaudal axis during QB; and 4) ∼90 s of small left and right body oscillations. The total video recording lasts around 8 min. During the second test, the subjects were required to perform ∼110 s of QB after physical exercise (i.e., running at high knees) for approximately 2 min of video recording.
An apnea stage of about 15 s was performed between one task and the following to then discriminate them. The experimental trials were guided via a graphical interface running on a tablet located in front of the subject. In addition, subjects were required to not wear glasses and females to not have make-up on their faces.

B. Dataset in Gym Environment
A dataset using the CANON EOS 1100D camera in a gym environment was made to investigate the use of digital cameras for the continuous estimation of cardiorespiratory parameters during physical activity. The digital camera (Canon, USA) was used to record one video of the face and the torso of the subjects during physical activity (resolution of 1280 × 720 pixels, at 30 fps). It was placed on a tripod at about 0.5 m from the subject and the zoom was adjusted so that the region covered by the face in the whole image is of about 10%. The multiparametric wearable device Zephyr BioHarness v3 was used as a reference system. A total of nine healthy volunteers (six females and three males, aged between 25 and 33, Fitzpatrick Skin Phototype between II and III) were enrolled for the experiments. During the tests, each subject was required to sit on a cycle ergometer (Electronic bike RHC-100, Air Machine, Cesena, Italy) and to perform the following protocol: 1) initial apnea stage of ∼15 s; 2) ∼60 s of QB at rest; and 3) ∼300 s of cycling at a power of about 100 W. Each video recording lasts around 6 min.

C. Data Analysis
The collected videos both in the laboratory and gym environment were post-processed in MATLAB environment according to the steps reported in Fig. 2. All the r-PPG and the breathing signals were synchronized with the ECG and the respiratory waveforms from the reference system, starting from the first minimum point after the apnea stage. An analysis in the frequency domain was carried out to estimate the values of f R and HR both for reference and video signals as reported in Section II-C. HR values were estimated from signals retrieved with the CHROM algorithm. Sometimes, some estimated values of f R and HR are inconsistent with the other estimated values. Thus, these values (i.e., outliers) were removed using the Hampel filter, which identifies and replaces the outliers through a moving median window [49]. Fig. 3 reports the temporal trends of the estimated f R and HR values against the reference after the outliers' removal in the five trials performed in a laboratory environment for one subject. The first line reports the nose velocities (v) computed in each trial, whereas the second line shows the torso's displacements along the x-axis (s) extracted through OF in each trial.

D. Evaluation Metrics
To evaluate the performance of f R and HR extraction, the mean absolute error (MAE) and the mean absolute percentage error (MAPE) were computed. In the first analysis of cardiac signals, MAE was used for evaluating the performance of each r-PPG implemented algorithm and identifying the most promising. In addition, a Bland-Altman analysis was carried out to investigate the agreement between the implemented approaches and the reference values both for f R and HR estimation and to quantify the discrepancies.

IV. RESULTS
This section presents the results obtained in the continuous estimation of f R and HR. We first report the results obtained in a laboratory environment for f R and HR both when considering the threshold on the head movement and the influence of different modalities of video analysis and when neglecting them. Then, the results obtained in the gym environment are reported.
A. Dataset in Laboratory Environment 1) Respiratory Rate Estimation: When the threshold on the velocity was not considered and the modality of video analysis with fixed ROI was used, MAE values below 1 breaths·min −1 were obtained for the motionless trials (i.e., QB, tachypnea, head movement, and post-exercise), whereas an MAE of 2.56 and 5.71 breaths·min −1 were obtained during head and torso movements, respectively. Fig. 4 shows the Bland-Altman plots for the f R values estimated through OF in the five trials using the fixed ROI. The dashed line represents the mean of difference (MOD), and the red lines represent the upper and low limits of agreement (LOA). A good agreement was achieved during QB, tachypnea, and post-exercise, with a MOD ± LOAs of 0.04 ± 2.16 breaths·min −1 (mean error of ∼11%), 0.24 ± 6.11 breaths·min −1 (∼15%) and 0.05 ± 4.83 breaths·min −1 (∼19%), respectively. The values of LOAs increase when f R was estimated during head and torso movements (±8.92 and ±14.67 breaths·min −1 , with a MOD of 0.34 and 2.50 breaths·min −1 , respectively).  During head and torso movements, the values of MAE are generally slightly higher. However, MAE < 5 breaths·min −1 with the higher percentage of monitored windows (i.e., 88% during head movements and 63% during torso movements) were achieved at 400 pixel/s for all the implemented modalities of video analysis (see Table S1). Comparing the different modalities of video analysis, Mode fixed ROI results to be the best in motionless trials, although when performing head and torso movements Mode 10 s is most suitable with MAE below 4.09 breaths·min −1 at all the considered velocity thresholds (see Table S2 for Bland-Altman analysis).
2) HR Estimation: The performances of each implemented r-PPG algorithm in the estimation of HR in the different trials were investigated by analyzing the values of MAE. According to the achieved values reported in Table S3, the CHROM algorithm results to be the best method to retrieve the valuable r-PPG signal for the estimation of HR in all the performed trials (i.e., QB, tachypnea, head movement, torso movement, and post-exercise) with an MAE < 5 bpm. As a result, the CHROM algorithm was used for further analyses, including the estimation of HR values considering the threshold on the velocity of the pixel of the nose's landmark and the analysis of the ROI tracking on the subject's face.
To evaluate the performances of the proposed framework under changing velocity thresholds and modality of video analysis, we computed the MAE and the MAPE for all five trials (Table II). MAE values below 4.26 bpm and MAPE < 3.89% were achieved in motionless trials (i.e., QB, tachypnea, and post-exercise) for all the tested modality of analysis and for all the velocity thresholds. As for f R estimation, the different number of windows respecting the thresholds has been used to calculate the MAE. The percentage of windows used for HR in each condition are reported in Table S4 for all the implemented modalities of video analysis. When considering Mode 1, MAE of 3.17 bpm was obtained at 100 pixel/s during head movement, but only 6% of all the windows were used for the estimation. Whereas, during torso movement a MAE of 1.22 bpm was achieved at 400 pixel/s for Mode 1 and 62% of the total windows were used to estimate HR. Bland-Altman plots comparing HR estimated when using Modality 1 of video analysis in the five trials with a velocity threshold of 400 pixel/s. Each color corresponds to a single subject.
Comparing the different modalities of video analysis (i.e., when increasing ROI tracking duration), there is a slight increase of MAE values for all the velocity thresholds in motionless trials (e.g., during QB, MAE values range from 0.94 to 1.10 bpm at 50 pixel/s). Whereas, during head and torso movements, MAE highly increases under changing modality of video analysis (e.g., during torso movement, MAE ranges from 1.22 to 14.59 bpm at 400 pixel/s). The best modality of video analysis results is Mode 1 both for motionless trials and for trials with head and torso movements. However, when analyzing the videos with the Mode 1 the running time required to perform the ROI tracking is 65 min. Increasing the ROI tracking duration, the running time decreases up to 20 min (i.e., Mode 5 s). Fig. 5 shows the Bland-Altman plots for HR values estimated in the five trials when using Mode 1 for the video analysis with a velocity threshold of 400 pixel/s. Good agreement was achieved during QB, tachypnea, and torso movement, with a MOD ± LOAs of -0.48 ± 4.33 bpm (mean error of ∼6%), 0.16 ± 7.17 bpm(∼8%), and -0.51 ± 3.92 bpm(∼5%), respectively. Higher values of LOAs were achieved during head movement and post-exercise (±15.45 bpm -∼20% and ± 10.49 bpm -∼10%, respectively). In accordance with the values of MAE, changing the modality of video analysis to track the identified ROIs, the values of LOAs increase (see Table S5).

B. Dataset in Gym Environment
Based on the promising results obtained in a laboratory environment, experiments were carried out in a gym during physical activity. Assuming that no considerable motions can occur during physical activity on a cycle ergometer, here we report the obtained results in the estimation of f R with fixed ROI modality and HR with Mode 1 under changing velocity thresholds. MAE values were computed per each velocity threshold during QB and physical activity for the two parameters as reported in Table III. Regarding f R estimation, an MAE below 3 breaths·min −1 was achieved when the threshold is 50 pixel/s, whereas for HR estimation, MAE below 4 bpm was obtained at 50 pixel/s. However, during physical activity, 42% and 38% of the windows were used for the computation of f R and HR, respectively. Increasing the velocity threshold up to 400 pixel/s the number of windows used for the estimation of the vital signs  increase (i.e., 98% and 97% for f R and HR, respectively), as well as the MAE. A Bland-Altman analysis was performed considering the values of f R and HR estimated with a threshold velocity of 400 pixel/s (Fig. 6). For f R estimation, a good agreement was achieved during QB at rest before the physical activity with values of LOAs of ±5.99 breaths·min −1 . During the physical activity, a MOD ± LOAs of -4.10 ±13.80 breaths·min −1 was obtained. Regarding the HR estimation, values of LOAs of ±11.45 and ±17.16 bpm were obtained during QB at rest before the physical activity and during physical activity, with approximately 15% of error in both cases considering the whole HR ranges.

V. DISCUSSIONS
In this article, we focus on the analysis of how facial movements and ROI tracking duration influence the performance of a method based on images captured with a single digital camera for the estimation of HR and f R . The proposed framework allows the continuous estimation of these two parameters both in simulated and real scenarios (i.e., during physical exercise in a gym).

A. Dataset in Laboratory Environment
The obtained results from the dataset in a laboratory environment show that the velocity threshold of 50 pixel/s allows achieving low values of MAE both for f R and HR estimation in all the motionless trials and a very continuous monitoring can be performed since all the windows were used to estimate the two vital signs.
For head and torso movements, it is difficult to find the right compromise between the velocity threshold to be set, the windows number used to perform f R and HR estimation, and the low values of MAE. Setting a velocity threshold at 400 pixel/s allows continuous monitoring of both f R and HR during head and torso movements using a high number of windows for the estimation of the vital signs.
Focusing on f R estimation, we obtained comparable results in motionless trials both when considering the velocity thresholds and when neglecting them. An MAE < 2 breaths·min −1 was achieved and values of LOAs comparable with those obtained in [32] (LOAs of ±1.04 breaths·min −1 ) were obtained in motionless trials when considering the Mode 10 s for the video analysis. During head and torso movements, the obtained results in terms of MAE and LOAs change under changing velocity thresholds and with the windows number used to estimate f R . The best results in terms of low values of MAE and LOAs were achieved at 50 pixel/s. The obtained MOD ± LOAs of 0.25 ± 2.18 breaths·min −1 at 50 pixel/s (Mode 10 s) during head movement is comparable with that reported in [32] (MOD ± LOAs of 0.18 ± 2.45 breaths·min −1 ) during tests in non-stationary scenario.
Regarding HR estimation, we obtained comparable MAE in all the tested modalities of video analysis both when considering the velocity thresholds and when neglecting them. MAE is between 0.94 and 4.30 bpm for the motionless trials and per each modality of video analysis when the velocity threshold was not considered. These values are comparable with those obtained in [29] (MAE of 6.13 bpm). When head and torso movements were performed, MAE increased ranging from 4.26 and 10.88 bpm for head movement and ranging from 3.74 and 15.22 bpm during torso movement without no threshold on the velocity (see Table S6). Comparing the different modalities of video analysis under changing velocity thresholds Mode 10 s and Mode 1 result to be the most suitable for f R and HR estimation in all five trials, respectively. However, when performing the HR monitoring in the absence of motions or after a physical exercise, the video can be analyzed with all the tested modes of analysis with MAE < 4.26 bpm (see Table II), which is in accordance with the threshold defined in the ANSI/AAMI EC13:2002 standard. This result suggests that the ROI tracking can be performed every 20 s obtaining a continuous estimation of HR per each velocity threshold in motionless trials with low errors and reducing the computational burden. However, in the case of HR monitoring in the presence of head and/or torso movements, an ROI tracking every frame should be executed, as also reported in other studies [37], [50]. The results obtained under changing velocity thresholds during head and torso movements in terms of MAE and LOAs are comparable with other studies, in which an average value of HR was estimated during lateral movements or head movements and/or natural movements, such as head rotation, blinking, and speaking (LOAs of ±3.07 bpm in [51], MAE of 5.45 bpm in [31], LOAs of ±2.45 bpm in [32]). However, unlike the reported studies, our work provides continuous monitoring of HR even in dynamic trials (LOAs of ±3.92 bpm during torso movements and LOAs of ±15.45 bpm during head movements) when setting a velocity threshold of 400 pixel/s.

B. Dataset in Gym Environment
To the best of our knowledge, there are few studies concerning the estimation of f R during physical activity, thus it is difficult to have a comparison with other research. However, regarding HR estimation there are various research articles that explore the potentiality of estimating HR during physical activity [52], [53], [54]. Comparing our results in the estimation of HR in terms of MAE, our results are in line with those obtained in [54], in which an MAE of ∼2 bpm was achieved in the estimation of average HR during 3 min of physical activity. However, to the best of our knowledge, none of the studies available in the literature have done a quantitative analysis of cardiac and respiratory unrelated movements based on the velocity of the pixel of the nose's landmark during a physical exercise to obtain continuous monitoring of cardiorespiratory parameters.

VI. CONCLUSION
In this article, promising results are obtained for the monitoring of cardiorespiratory parameters in both motionless and dynamic conditions and during physical activity. A single digital camera was used to record videos and a novel approach that considers the velocity of the performed movements is presented, trying to provide a framework for the continuous estimation of f R and HR during both static and dynamic activities by optimizing the ROI tracking duration under different velocity of facial movements. Results demonstrated that a continuous estimation of f R and HR can be performed by setting a velocity threshold of 400 pixel/s both in the laboratory and gym environment, with MAE below 3.42 breaths·min −1 for f R when performing ROI tracking between Mode 1 s and Mode 20 s and MAE < 5 bpm for HR estimation in all the tested modality of video analysis per all the motionless trials. In dynamic trials (i.e., head and torso movements) a continuous estimation of f R can be performed with a MAE < 5.03 breaths·min −1 in all tested modalities of video analysis, whereas during cycling on a cycle ergometer f R can be estimated continuously with a MAE below 4.57 breaths·min −1 when using the fixed ROI modality under the assumption that there are no considerable motions. Continuous estimation of HR can be performed with an MAE below 6.10 bpm when using Mode 1 in all the dynamic trials, with a run processing time of about 65 min. However, the other modalities of video analysis can be used to reduce the run processing time to 20 min, but higher values of MAE are achieved. One limitation of the study is related to the non-inclusion of subjects with different skin colors in the experimental trials, which will be further investigated. In addition, future investigations will be devoted to the improvement of the proposed motion-robust framework, performing tests on a larger population of different ages including the elderly and children, and in other scenarios (e.g., during sports activities, in clinical scenarios, or in homemonitoring).
ACKNOWLEDGMENT This work was carried out in the framework of the project titled "Sviluppo di sistemi di misura e modelli per la stima senza contatto di parametri fisiologici," POR Lazio FSE 2014/2020 (CUP code n. F87C21000190009, Project ID 23562).