Face2PPG: An Unsupervised Pipeline for Blood Volume Pulse Extraction From Faces

Photoplethysmography (PPG) signals have become a key technology in many fields, such as medicine, well-being, or sports. Our work proposes a set of pipelines to extract remote PPG signals (rPPG) from the face robustly, reliably, and configurably. We identify and evaluate the possible choices in the critical steps of unsupervised rPPG methodologies. We assess a state-of-the-art processing pipeline in six different datasets, incorporating important corrections in the methodology that ensure reproducible and fair comparisons. In addition, we extend the pipeline by proposing three novel ideas; 1) a new method to stabilize the detected face based on a rigid mesh normalization; 2) a new method to dynamically select the different regions in the face that provide the best raw signals, and 3) a new RGB to rPPG transformation method, called Orthogonal Matrix Image Transformation (OMIT) based on QR decomposition, that increases robustness against compression artifacts. We show that all three changes introduce noticeable improvements in retrieving rPPG signals from faces, obtaining state-of-the-art results compared with unsupervised, non-learning-based methodologies and, in some databases, very close to supervised, learning-based methods. We perform a comparative study to quantify the contribution of each proposed idea. In addition, we depict a series of observations that could help in future implementations.


Introduction
Photoplethysmography (PPG) signals have become a key technology in many fields, such as medicine, well-being, or sports.The technology utilizes a light source and a photodetector to measure the blood volume pulse (BVP) as light variations in skin tissues [1].In medicine, PPG analysis is a basic and common tool in healthcare services to monitor vital signs such as heart rate (HR) or oxygen saturation (SpO 2 ) [2].In wellbeing, it became increasingly important thanks to the success of wearable devices that analyze sleep disorders [3], cardiovascular diseases [4], or detection of stress and meditation [5].In sports, PPG analysis became an important tool to improve the intrinsic and extrinsic athletes' performance [6].
Remote PPG (rPPG) imaging is a contactless version of this technology that uses video cameras, usually consumer-grade RGB or nearinfrared cameras, and ambient light sources.It works by recording a subject's face or body parts with visible skin areas and analyzing the subtle color variations or motion changes in skin regions [7][8].The remote PPG technique allows for noninvasive evaluation and monitoring of users in services, such as healthcare.Hence, the technology could offer significant advantages compared to contact-based devices if it becomes reliable [9].
Current approaches for recovering physiological signals from videos are mainly unsupervised non-learning-based methods and supervised deep learning approaches [8].Deep learning-based methods propose end-to-end solutions utilizing training datasets.In contrast, unsupervised nonlearning-based (sometimes named "Traditional") methods employ computer vision and signal processing in a structured pipeline, as illustrated in Figure 1.
Most of the unsupervised rPPG methods proposed in the literature focus on recovering PPG signals mostly from static faces, disregarding challenges under real scenarios in real-world applications such as fast face and head movements, extreme light conditions, facial expressions, illumination changes, occlusion, or distance from the camera to the subject.
This article focuses on improving the performance of state-of-the-art rPPG unsupervised, non-learning-based methods, emphasizing all system components.We tackle the improvement of the process by proposing a set of changes across the whole pipeline that result in a noticeable global improvement, performing an extensive evaluation and a framework to recover physiological signals from faces.

Contributions
This article depicts different performance problems and challenges to recovering PPG signals from faces reliably.We improve several components of the rPPG pipeline with novel ideas.The main contributions can be summarized as follows: • We provide a new method to stabilize the movement and facial expression based on a rigid mesh normalization, ensuring that the raw RGB signals are measured from the same facial location regardless of the pose and movement.• We provide a new method based on statistical and fractal analysis to dynamically select only the facial regions that supply the best raw signals, discarding those with higher noise or prone to artifacts.• We propose a novel rPPG method to transform the RGB signal into a PPG signal based on QR decomposition, named Orthogonal Matrix Image Transformation (OMIT), which proves to be robust to video compression artifacts.
To prove the usefulness of our approach, we extensively evaluate a set of rPPG methods with four different pipelines across several datasets.Our experiments include modifications to the original evaluation pipeline to increase the fairness and reproducibility of the comparative results.

Related work
In the last few years, rPPG research has progressed from the filtering and simple processing of the variation of the facial skin color to sophisticated multi-step processing pipelines and end-toend supervised learning methods with dedicated architectures.

Unsupervised methods
Unsupervised non-learning-based methods focus on recovering physiological signals by applying computer vision and signal processing techniques as a system with several steps.These methods focus on obtaining the BVP signal by finding skin areas suitable to extract the raw RGB signals, using face detection, tracking, and segmentation techniques.After that, these methods carefully process these raw RGB signals to separate the physiological signals contained in the subtle variations of the skin color from the rest of the information (motion, illumination changes, or facial expressions, among others) by applying filtering and different ways of combining the RGB signals into an rPPG signal.Most of the studies focus mainly on this transformation component using similar approaches and components for the rest of the process [8].[10] proposed the first study on extracting remote PPG signals using an inexpensive consumer-grade RGB camera.The study showed how the green channel of the camera contains rich enough, significant information to recover signals such as the heart pulse.[11] proposed the recovery of physiological signals by applying the blind source separation (BSS) technique to remove the noise.Concretely, they used Independent Component Analysis (ICA) to uncover the independent source signals.Similar to this work, [12] proposed Principal Component Analysis (PCA) to reduce the computational complexity in comparison to Independent Component Analysis with similar accuracy performance.[13] proposed a chrominance-based method (CHROM) to separate the specular reflection component from the diffuse reflection component, which contains pulsatile physiological signals, both reflected from the skin and based on the dichromatic reflection model.[14] define a Blood-Volume Pulse (PBV) vector that contains the signature of the specific blood volume variations in the skin, removing noise and motion artifacts.In the same year, [15] focused on removing the human motions and artifacts from the RGB signals by applying Normalized Least Mean Square (NLMS) adaptive filter.They perform this rectification step by assuming both the face ROI and the background as Lambertian models that share the same light source.[16] proposed a data-driven algorithm by creating a subspace of skin pixels and the computation of the temporal rotation angle of the computed subspace between subsequent frames to extract the heart rate pulse.[17] proposed the CIELab color space (LAB) transformation as a more robust color space to extract pulse rate signals due to the high separation between the intensity and chromaticity components, less sensitive to human body movements.The study also demonstrates that the a channel has a better signal-to-noise ratio (SNR) than the green channel in RGB color space.[18] proposed a new plane-orthogonal-toskin (POS) algorithm that finds pulsatile signals in an RGB normalized space orthogonal to the skin tone.[19] proposed the Local Group Invariance (LGI) method, a stochastic representation of the pulse signal based on a model that leverages the local invariance of the heart rate as a quasi-periodical process dynamics and obtained by recursive inference to remove extrinsic factors such as head motion and lightness variations.[20] proposed an unsupervised method with an emphasis on motion suppression and novel filtering based on the head orientations (FaceRPPG).Recently, deep learning methods have been developed that do not rely on reference signals, which could be considered unsupervised [21] [22].However, they remain specifically tailored to each dataset and its unique characteristics.
Unsupervised methods offer a significant advantage in that they do not necessitate specific training data, allowing for better generalization across different datasets and measurement setups.These methods focus on measuring the BVP signal as it manifests in various facial regions, ensuring adaptability to diverse scenarios.However, given the absence of a learning component, the performance of unsupervised methods can be influenced by several factors: sensitivity to noise and artifacts, dependency on user-defined parameters (many unsupervised methods rely on user-defined parameters for their performance, making them less adaptable to diverse scenarios and subjects), and limited adaptability to varying skin types and lighting conditions (previous unsupervised methods may not be well-suited for handling diverse skin types and lighting conditions, which can result in suboptimal rPPG signal extraction).
Unsupervised non-learning-based method's efforts have focused on finding suitable ways of transforming noisy RGB signals into reliable PPGs.In contrast, the impact of other system components, such as face detection and tracking, has been mainly disregarded.Our contribution addresses the unsupervised rPPG process as a system with multiple components that can be improved separately.

Supervised methods
Deep Neural Networks (DL) and especially Convolutional Neural Networks (CNN) approaches have gained attention and become popular tools in computer vision and signal processing tasks, including healthcare-related tasks.Before the advent of deep-learning based methods, there were two primary approaches to estimate heart rate using machine learning methods, including supportvector regression [23] and adaptive hidden Markov models [24].Since 2018, supervised deep learningbased methods to compute HR or other vital signs started to arise increasingly in the literature.Among the most relevant learning-based remote PPG methods, [25] proposed a two-step convolutional neural network to estimate a heart rate value from a sequence of facial images.HR-CNN is a trained end-to-end network composed of two components, an Extractor and an HR Estimator.The same year, [26] proposed Deep-Phys, another end-to-end solution based on a deep convolutional network that estimates HR and breathing rate (BR).The approach performs a motion analysis based on attention mechanisms and a skin reflection model using appearance information to extract the physiological signals.[27] proposed RhythmNet, an end-to-end solution based on spatial-temporal mapping to represent the HR signals in videos.The approach also exploits temporal relationships of adjacent HR estimations to perform continuous heart rate measurements.The same year, [28] proposed a two-stage end-to-end solution.The first part of the network, named STVEN, is a Spatio-Temporal Video Enhancement Network to improve the quality of highly compressed videos.The second part of the approach, called rPPGNet, is the 3D-CNN network that recovers the rPPG signals from the enhanced videos.The authors claim that the proposed rPPGNet produces rich rPPG signals with curve shapes and peak locations.[29] further proposed another end-to-end approach for remote HR measurement based on Neural Architecture Search (NAS).AutoHR is comprised of three ideas: a first stage that discovers the best topology network to extract the physiological signals based on Temporal Difference Convolution (TDC); a hybrid loss function based on temporal and frequency constraints; and Spatio-temporal data augmentation strategies to improve the learning stage.The same year, [30] proposed a transductive meta-learner based on an LSTM estimator and Synthetic Gradient Generator that adjusts network weights in a self-supervised manner.Recently, [31] proposed a two-stage hybrid method called PulseGAN.It starts with the unsupervised extraction of noisy PPG signals using CHROM as a first stage, followed by a generative adversarial network (GAN) that generates realistic rPPG pulse signals from the signals recovered in the first stage.In 2021, [32] proposed a novel denoising-rPPG method called AND-rPPG, based on the utilization of Action Units (AUs) and Temporal Convolutional Networks (TCNs) for denoising temporal signals to mitigate facial expression noises effectively.
Supervised rPPG methods, primarily leveraging deep learning, exhibit exceptional accuracy by employing end-to-end solutions that aim to extract physiological signals or heart rate values directly from video data.Apart from essential preprocessing, these methods necessitate minimal intermediate steps [33].Through observing facial noise patterns, supervised methods attempt to learn reference contact-based PPG ground-truth signals from the finger, culminating in a black box model that recovers physiological signals from video frames without a clear understanding of the underlying mechanisms.This lack of transparency presents significant challenges to the practical application of these signals in critical areas such as healthcare [34], as well as in scenarios where the available data for training is limited and prone to anomalies (e.g., for cardiovascular conditions).Furthermore, the dependency of supervised methods on labeled training data raises issues of data acquisition and cost, especially in medical contexts where privacy and ethical concerns are of utmost importance.Supervised methods may also face difficulties in generalizing to new scenarios, subjects, or recording conditions if the training data does not encompass a wide array of variations.Moreover, supervised methods may need substantial computational resources.

Unsupervised blood volume pulse extraction methodology
Our work uses a standard methodology in unsupervised rPPG approaches to extract the blood volume changes from facial videos and derive essential parameters.We follow a modular pipeline with several components, as depicted in Figure 1.The pipeline is roughly divided into three big blocks: the selection of measuring regions of the face, the extraction of rPPG biosignals from natural variations in color or texture, and the computation of the heart rate or other parameters using the extracted signals.
The pipeline comprises of 8 main modules sequentially connected:

Baseline pipeline
In our work, we start from an open-source framework for the evaluation of remote PPG methods, implemented in Python and called pyVHR (short for Python tool for Virtual Heart Rate) [35].This framework includes an extensible interface to integrate several datasets, multiple methods, choices for each processing step, and extensive assessment and visualization tools.We use version 0.0.4 of the PyVHR framework, which we will refer to as Baseline pipeline onward.

Face2PPG pipelines
The Baseline pipeline presents a few shortcomings that might result in inaccurate or unfair assessments.We incrementally improve it by introducing several changes in multiple steps, and propose three new versions that we name Improved, Normalized and Multi-region pipelines.These pipelines focus on handling the face in unconstrained conditions since it is one of the critical parts of extracting remote photoplethysmograms.It has to be noted that most of the unsupervised approaches have focused mainly on developing RGB to PPG conversion methods (sometimes called just rPPG methods), but they have not paid much attention to other steps of the pipeline.This article emphasizes the importance of every step when extracting and evaluating remote PPG signals from faces.Moreover, we have devised a novel method, Orthogonal Matrix Image Transformation (OMIT), which employs QR decomposition to convert raw RGB signals into BVP signals.

Improved pipeline
To mitigate its shortcomings, we modified the Baseline pipeline to incorporate a few minor changes that increase reproducibility and enable a fairer comparison of different methods.We name this modified version as Improved pipeline.We enumerate and describe these changes as follows: Face detection: The Baseline pipeline includes two well-know face detectors: one based on convolutional neural networks known as MTCNN [36] and a Dlib implementation based on Histogram of Oriented Gradients (HOG) features [37].We use instead a new deep learning-based face detection method based on a Single Shot Multibox Detection network (SSD) [38], implemented in OpenCV library.This face detector outperforms the Baseline pipeline detectors in terms of accuracy, size of the models, and computational speed [39].
Face alignment: The Baseline pipeline includes two well-know face landmark detectors, the MTCNN detector [36] that computes 5 landmark points in the face (eyes, nose and mouth corners) and the Dlib implementation of the ERT method [40][37].We use instead a deep learning approach named DAN (Deep Alignment Network) [41] which gives exceptional performance in terms of accuracy, even in challenging conditions [42] as shown in Figure 2.For a faster, real-time face alignment, we have added a more accurate, faster, and smoother model for the ERT Dlib face landmarks detector [42].These models both infer 68 landmark points defined by the Multi-PIE landmark scheme.
Filtering: The Baseline pipeline only considers a pre-filtering scheme before the RGB to PPG transformation.The pipeline offers three types of filters: detrending (Scipy or Tarvainen methods), bandpass filtering (FIR filter with Hamming window and Butterworth IIR filter), and a Moving average filter (MA) that removes various base noises and motion artifacts of the signals.We have added the possibility of using Kaiser windows when applying FIR filtering.A Kaiser-Besel window maximizes the energy concentration in the main lobe, and it is highly recommended to filter biosignals [43].In addition, we have introduced the possibility of applying also post-filtering, performed after the RGB to PPG conversion, since the literature suggests that some conversion methods perform better this way [44].
RGB to PPG transformation (rPPG): The Baseline pipeline includes several reference methods such as POS [18], CHROM [13], GREEN [10], PCA [12], ICA [11], SSR [16], LGI [19] and PVB [14].We have added one method based on selecting the chroma channel a after applying a CIE Lab color space transformation.CIE Lab separates the lightness information (channel L) from the chroma information (channels a and b).The chrominance components have a more significant dynamic range than the red, green, and blue channels in RGB color space [17].In addition, it correlates with skin color and related parameters and describes better the subtle changes occurring in them [45].
Spectral analysis: In the original framework, the ground-truth pulse rate is estimated using Short Time Fourier Transform (STFT) when the ground-truth signal is a BVP signal, and R-Peak detection and RR interval analysis when the ground-truth signal is an ECG signal.In the Baseline pipeline, the recovered PPG signals are processed instead of using Welch's spectral density estimation.This mismatch introduces the possibility of unfair evaluation.We modified the pipeline, so the ground-truth BVP signal and the rPPG signal are processed using the same spectral analysis algorithm and similar parameters such as the overlap or FFT length.
Evaluation: It can be expected that the reference BVP (PPG) signals taken in the finger and the rPPG signals extracted from the face are not perfectly synchronized or show the same dynamic range.We show an example in the Figure 3.We can also observe asynchrony in the heart rate estimation, as shown in Figure 4. Technical and physiological factors can produce time shifts and morphological differences in the signals.These effects can be attributed to the distance between the optical sensors, the contact force effect of the finger PPG oximeter [46], different filtering parameters [47], individual variations among subjects [48], variability in the measurement site [49], and even blood perfusion differences in different body regions [50], among others.We mitigate some of the effects caused by comparing fundamentally different signals by adding a new parameter in the dataset interface that aligns both signals in terms of time and dynamic range, resulting in a fairer estimation of the error between ground-truth HR estimation and rPPG HR estimation.The parameter has been adjusted for each dataset globally using a mostly empirical approach due to the lack of detailed information on the measurement setups.This approach ensures the consistency of signal alignment while maintaining the overall integrity of the data.

Normalized pipeline
Our Normalized pipeline introduces two significant changes compared to the previous ones, a segmentation approach based on geometric normalization, and a novel RGB to PPG transformation method, robust to compression artifacts.

Geometric segmentation and normalization
One of the critical steps in the non-contact PPG extraction is the process related to skin segmentation, since it is the source to recover the desired physiological signals.Most of the unsupervised methods and pipelines rely on simple thresholding of different color spaces for skin color segmentation [35], from inefficient fixed RGB segmentation to adaptive HSV segmentation.These pixel-level techniques suffer from generalization due to the variability of skin tones, skin paint, illumination changes, and complex backgrounds.It is not easy to define clear boundaries between the skin and non-skin pixels, mainly due to the variability of the facial regions measured across frames of a single video.Framewise skin segmentation based on neural networks (e.g., uNET) suffers similar problems, sometimes caused by the small number of annotated facial skin masks, resulting in underfitted models [51].
We propose using a geometrical segmentation scheme that uses fiducial landmark points detected in the face.Although some interframe jittering due to the landmark variability remains and produces changes in the skin mask across the video frames, this is noticeably lower than using skin color segmentation.To perform this segmentation, we have extended the set of landmark points from 68 to 85 landmark points by interpolation and created a fixed facial mesh composed of 131 triangles and fix their coordinates as a typical frontal face, as shown in Figure 5.Our segmentation approach normalizes the face in each frame by mapping every triangle in the current detected face to the triangles in the normalized shape as shown in Figure 5.This approach generates a spatio-temporal matrix of normalized faces that ensures that we measure the signals in the same facial regions consistently across frames, regardless of the pose and movement.

Multi-region pipeline
Extracting only one signal from the whole face or skin mask can result in a very noisy signal.However, the previously described skin segmentation produces a set of facial regions that can be analyzed separately.Due to partial occlusions, extreme head poses, illumination variations, or shades, among others, some of these regions might present very noisy signals with low dynamic ranges.Previous methods have proposed to select fixed patches in the face, where a priori, the blood perfusion should be more observable (e.g., forehead and cheeks).This approach works relatively well when the videos show quasi-static individuals in fixed environments but fails when presented with fast movements or strong face rotations.To mitigate the impact of these challenges, we propose to modify the pipeline by introducing a dedicated block that automatically and dynamically selects those regions that contain the raw signals with higher quality.We name the resulting framework as Multi-Region Pipeline.

Dynamic multi-region selection
We propose a novel Dynamic Multi-Region Selection (DMRS) method to select the best facial regions dynamically.This approach extracts signals in a fixed set of facial regions and statistically analyzes their quality to choose whether each one of them should contribute to the final rPPG signal or if it should be discarded.
The DMRS process starts just after obtaining a segmented and normalized face in the previous block of the pipeline by dividing the normalized face into a matrix of nxn rectangles (regions of interest) that contains a spatio-temporal representation of the face, as depicted in Figure 6.Each area in the grid represents a signal in a sequence of frames.Next, the process continues the analysis by computing several statistical parameters on each candidate region and the global face.These statistical parameters are computed using windows of t seconds, both in time and frequency domains.We extract the mean, standard deviation, variance, signal-to-noise ratio (SNR), Katz Fractal dimension (KFD), number of zero-crossings (Z c ), sample entropy, detrended fluctuation analysis (DFA), and the energy in terms of local power spectral density (PSD).The dynamic selection is loosely based on fractal analysis, which portrays the scale of the randomness or how unpredictable a stochastic process is [52].
The first step is an initial pruning that removes those regions that do not contain valuable information.We check the variance changes of every candidate rectangle along the time t and discard those that have zero variance.
Then, the regions are discarded based on the thresholding of Katz's Fractal Dimension (KFD).KFD computes the fractal dimension (FD) of a signal based on the morphology, measuring the degree of irregularity and the sharpness of the waveform [52].KFD gives an index D KF D for characterizing the complexity of the signal.KFD value is computed as shown in the following equation: where L is the total length of the PPG time series, a is the average of the Euclidean distances between the successive points of the sample, d is the Euclidean distance between the first point in the series and the point that provides the maximum distance with respect to the first point, and n is L/a.We calculate then the relative D KFD value by dividing the KFD index of a specific facial region by the global KFD index derived from the entire face, as shown in the following equation: This relative value plays an important role in identifying the regions that contribute meaningful information to the final rPPG signal while addressing the inherent challenges posed by facial regions, such as occlusions, low light conditions, blur, and other factors that could adversely affect signal quality and the complexity of the signal.By selecting regions with a relative D KFD value of 0.85 or greater, we ensure the inclusion of regions that exhibit complexity levels comparable to the global PPG signal.This approach is based on the assumption that the primary source of complexity in the global rPPG signal stems from the heart, and regions with similar complexity levels are more likely to provide valuable information for rPPG analysis, despite the presence of factors that could potentially compromise the signal quality.Through this selection process, we effectively discard regions that introduce noise and artifacts in the final rPPG signal.This method provides provide a good balance between discarding low-quality regions and retaining meaningful information.
The analysis continues by using Detrended Fluctuation Analysis (DFA), a statistical method widely used to detect intrinsic self-similarity in non-stationary time series, especially in fractal signals.DFA is a modified root mean square analysis of a random walk, designed to compute long and short-range non-uniform correlations in stochastic processes [53].The method tells if each region rPPG signal shows the expected correlation with the signal from the global face signal and if they are very noisy or contain artifacts of extrinsic trends [54].The DFA exponent α is interpreted as an estimation of the Hurst parameter, and it is calculated as the slope of a straight line fit to the log-log graph from the fluctuation function.if α = 0.5, the time series is uncorrelated.If 0.5 < α < 1 then there are positive correlations in the time series.If α < 0.5 then the time series is anti-correlated.We discard those regions as uncorrelated and negatively correlated.
Upon completing the analysis of each facial region within a sequence of frames (window of t seconds), we obtain a set of valid regions, with a minimum of 2 and a maximum of 32 regions for the final selection step, which is based on energy content.In cases where more than r max regions pass the previous analysis, our method selects the top r max regions with the highest energy.These regions are expected to exhibit less noisy spectral responses within the relevant frequency range, potentially leading to an improved signal-to-noise ratio [55] [56].If the number of regions passing the previous analysis is less than r max but greater than one, all these regions are selected.In contrast, when only one or no regions satisfy the prior criteria, the method reverts to selecting the best r max regions by energy content from the initial set of candidate regions.This strategic approach ensures the derivation of the resulting rPPG signal from an optimal number of regions, thus maximizing signal quality across various scenarios.In the last step, the rPPG signals from the chosen regions are combined by summing them in the time domain, generating the ultimate rPPG signal used for heart rate computation during the subsequent spectral analysis phase.
During the experiments, we have comparatively evaluated Face2PPG pipeline both in single-region mode (Face2PPG-Normalized ) and in multi-region mode (Face2PPG-Multi ).

Orthogonal Matrix Image Transformation
The RGB to PPG transformation is a critical step in the field of remote photoplethysmography that allows for the extraction of physiological signals from the color skin variations.However, the process is challenging due to the presence of noise and artifacts in the raw RGB signal.To address these challenges, we introduce a novel, robust, and efficient method called Orthogonal Matrix Image Transformation (OMIT), that we integrated in the previous proposed pipelines.
OMIT is grounded on matrix decomposition techniques and aims to generate an orthogonal matrix with linearly uncorrelated components that represent orthonormal components in the RGB color basis.This allows for the accurate recovery of physiological signals.OMIT employs the reduced (or thin) QR factorization [57] in conjunction with Householder Reflections [58] [59] to find linear least-squares solutions in the RGB space.Thin QR factorization offers improved memory efficiency and computational speed compared to full QR factorization, particularly for tall and skinny matrices [57].Furthermore, the Householder Orthogonalization Algorithm provides better numerical stability, computational efficiency, and conditioning than the Gram-Schmidt process [57] [59].These advantages make the Householder Orthogonalization Algorithm particularly suitable for handling noisy or corrupted data matrices and extracting rPPG signals from raw RGB signals with greater accuracy and efficiency [60] [61].
The mathematical foundation of OMIT is based on the QR decomposition as shown in Equation 3: where A ∈ IR n×3 represents the input RGB matrix, Q ∈ IR n×3 denotes the orthonormal basis for the column space of A, and R ∈ IR 3×3 is an upper triangular matrix that contains the coefficients to express the columns of A as linear combination of the basis vectors in Q.We then use the orthogonal matrix Q to compute a projection matrix that allows us to extract the BVP signal from the input matrix A.
The OMIT method is composed of the following key steps, illustrated in Figure 7: 1. Reduced QR decomposition using Householder Reflections: Compute the thin QR decomposition of the input RGB matrix [57][62].For a given input matrix A of dimensions n×3, the Householder Reflectors (H i ) are computed iteratively, transforming A into an upper triangular matrix R. In each iteration, H i is an n × n matrix designed to eliminate the elements below the diagonal in the i-th column of A or its intermediate form.After k iterations (in our case, k = 3), the product of these H i matrices results in the orthogonal matrix Q, while the transformed A becomes the upper triangular matrix R. Mathematically, this can be expressed as Q = H 1 H 2 ...H k .The Q matrix is semi-orthogonal, meaning that its columns are orthonormal, i.e., they are orthogonal and have unit norm.In our case, Q is an n × 3 matrix, and its columns (q 1 , q 2 , q 3 ) form an orthonormal basis for the column space of the input RGB matrix A. The first column, q 1 , represents the direction in the RGB space that captures the most significant variations in the input data.In the context of rPPG, these variations are typically associated with changes in skin color due to ambient lighting, camera sensor noise, facial movements and other artifacts.

Subspace projection matrix calculation:
The first column of Q, denoted as S, is used to compute the projection matrix P.This step aims to create an orthogonal subspace to the direction of S, which is computed as P = I n − SS T .The projection matrix P is an n×n matrix calculated as the difference between the identity matrix of dimensions n (I n ) and the outer product of vector S with itself.This matrix is designed to project the input data onto a subspace orthogonal to S. In other words, P removes the contributions associated with the dominant variations in the input data (captured by q 1 ), which are typically unrelated to the BVP signal.

Orthogonal Projection and BVP extraction:
In the Orthogonal Projection step, the input data (RGB matrix) is projected onto a subspace orthogonal to q 1 using the calculated projection matrix P.This process is mathematically represented by the equation Y = P A. The purpose of this step is to remove the contributions associated with the dominant variations in the input data, which typically correspond to factors such as lighting conditions and facial movements reflected in the three color channels.The orthogonal projection has significant implications in the OMIT method as it effectively separates the BVP signal from the raw RGB data.The BVP-related information is preserved while the unrelated noise and artifacts are suppressed.The BVP signal is extracted from the second column of the Y matrix instead of the first column, since the first column corresponds to the dominant variations in the input data that were removed during the Orthogonal Projection step.In this context, the second column of the Y matrix represents the processed signal containing the BVP information after removing the dominant variations.
QR decomposition has been extensively applied across various domains such as communications, signal processing, image processing, and machine learning to address challenges associated with corrupted input data matrices due to noise or artifacts [60] [61].
QR decomposition offers several advantages over other decomposition methods such as Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) [62] [63].It is computationally efficient, mathematically stable, and robust to corrupted input data matrices due to noise or artifacts [64][61] [60].Furthermore, it is a widely used method in fields such as communications [65], signal processing [66], image processing [67], and machine learning [68] and is available in high-performance linear algebra libraries such as LAPACK [69] and Intel ® MKL [70], making it suitable for real-time applications [71].Leveraging these advantages, OMIT produces an orthogonal matrix with linearly uncorrelated components, Fig. 7: Process steps to convert a RGB signal to BVP signal using OMIT method.effectively segregating the rPPG signal from the original RGB data.This approach allows OMIT to enable more accurate and efficient extraction of the concealed blood pulse signal from the raw RGB data, paving the way for enhanced rPPG signal processing and analysis.By incorporating the robustness and stability of QR decomposition, OMIT outperforms EVD and other decomposition techniques, making it a compelling choice for rPPG signal extraction.

Benchmark datasets and Evaluation Metrics
To evaluate the proposed methodology, we follow an extensive evaluation assessment as presented in the literature [35].Our evaluation includes interfaces to work on six publicly available datasets: PURE is a database that contains 10 subjects performing several controlled head motions [72].The session was recorded using 6 different setups (steady, talking, slow translation, fast translation, slow rotation, and medium rotation), resulting in 60 sequences of 1 minute each.The videos were captured using an industrial-grade (eco274CVGE camera by SVS-Vistek ) at a sampling rate of 30 Hz with an uncompressed cropped resolution of 640x480 pixels and an approximate average distance of 1.1 meters.The reference pulse data was captured in parallel using a contact-based FDA-approved fingertip pulse oximeter (pulox CMS50E ) with a sampling rate of 60 Hz.
COHFACE is a remote photoplethysmography (rPPG) dataset that contains RGB videos with faces synchronized with heart rate and breathing rate of the recorded subjects [73].It contains videos of 40 subjects (12 females and 28 males).The video sequences were recorded using a (Logitech HD C525 ) webcam at a sampling rate of 20 Hz and a resolution of 640x480 pixels.The database includes a total of 160 videos, of approximately 1 minute.Reference physiological data was recorded using medical-grade equipment.
LGI-PPGI-Face-Video-Database is a database that contains 25 subjects, but only 6 were released officially [19].It was recorded using a Logitech HD C270 webcam at a sampling rate of 25 Hz and a resolution of 640x480 pixels, in uncompressed format with auto-exposure.Reference physiological measurements were recorded at the same time using a contact-based FDA-approved fingertip pulse oximeter (pulox CMS50E ) with a sampling rate of 60 Hz.The database contains subjects in four different scenarios: resting, rotation, talking in the street, and gym.An image with the four scenarios is depicted in Figure 8.
UBFC-RPPG Video dataset is a rPPG database comprised by two different datasets: UBFC1 and UBFC2 [74].UBFC1 contains 8 videos where the participants were asked to sit still in an office room under unconstrained conditions and natural light.UBFC2 contains 42 videos under constrained conditions and inducing changes in the BVP by asking participants to perform mathematical games.The database presents a wide variety of ethnicities with different facial skin tones as shown in the Figure 8.
The database was recorded using a webcam (Logitech C920 HD Pro) at a sampling rate of 30 Hz and a resolution of 640x480 pixels in uncompressed 8-bit RGB format.The duration of each Fig. 8: In the first row, LGI-PPGI database [19] contains four different scenario recordings.From left to right: 1) Resting, 2) Rotation or Head Motions, 3) Talking and 4) Gym.In the second row, UBFC-RPPG database [74] contains mostly two different scenarios.UBFC1 contains videos recorded in an office room under unconstrained conditions and natural light (from left to right the first two images).UBFC2 contains videos under controlled conditions but performing a stress task (from left to right, the last two images).
video is approximately two minutes long.The reference physiological data was synchronized and recorded at the same time using a contact-based FDA-approved fingertip pulse oximeter (pulox CMS50E ) with a sampling rate of 60 Hz.
MAHNOB-HCI is a multimodal database captured mainly for emotion recognition [75].The database contains 27 young healthy participants, 16 females and 11 males.The database was recorded with several cameras.The frontal camera was an Allied Vision Stingray F-046C colour camera with a resolution of 780x540 pixels at 60 frames per second.The videos in MAHNOB-HCI database are highly compressed in H.264/MPEG-4, making them very challenging for the extraction of remote PPG signals.The reference signals were captured using an ECG sensor from the Biosemi active II system with active electrodes.The database includes 527 facial videos with corresponding physiological signals.In our evaluation of the different pipelines (Table 1), we used a smaller subset of 36 videos from the MAHNOB-HCI database, as suggested in [35], to ensure a direct and fair comparison with the results first presented in [35].For the comparisons of the state of the art (Table 4, we employed the full dataset of 527 videos to compare our pipeline with other state-of-the-art approaches, providing a comprehensive assessment of our method's performance. The evaluation of the datasets is done by comparing the estimations of the heart rates of both the extracted rPPG signal and the reference ECG or PPG signal.The evaluation includes both error and statistical analysis.We use three standard metrics that measure the discrepancy between our predicted heart rate ĥ(t) and the reference heart rate h(t).The standard metrics used to compute it are Mean Absolute Error (MAE), Root-Mean-Square Error (RMSE), and Pearson Correlation Coefficient (PCC) of the heart-rate envelope.Our primary objective is to advance unsupervised rPPG extraction techniques in challenging conditions, rather than pursuing waveform similarity with ground truth PPG signals, as is the case with deep learning-based methods.Given the anatomical differences in blood perfusion waveforms between the face and finger [76], we employ PCC between heart rate envelopes as a more suitable evaluation metric, instead of direct waveform comparisons.

Reference data, ground-truth and evaluation protocol
The most common source of BVP reference data in the datasets is PPG data from contact-based pulse oximeters.In most of the cases, these data is already filtered and does not require further preprocessing.In order to extract heart rate or other HRV parameters, the reference signals are processed using spectral analysis.
In the evaluation, we compute the error by comparing the heart rate and HRV parameters extracted from the reference (ground-truth) signals and the recovered PPG signal.Although direct comparison of signals (e.g.morphology) would be also possible, the fundamental differences between the extracted rPPG and the reference signals in terms of delay and scale due to different body measurement points and diverse collection devices make this comparison not very meaningful [76].
We have observed that the reference data offered in the datasets is not completely free of problems.For example, Figure 9, shows an example reference signal with a gap of approximately 2 seconds.These issues, caused by small deficiencies in data collection can lead to unfair disagreements in terms of error, especially for unsupervised methods.

Experimental results
We evaluate and analyse the proposed methodologies and pipelines to extract remote photoplethysmography signals from all benchmark databases.We compare the results across different improved processing pipelines and compare them with the state of the art for both supervised and unsupervised methods.The experiments are performed using a computer that includes an AMD ® Ryzen(TM) 3700X 8-core processor at 3.6GHz.

Hyperparameters and configuration
Our framework is based on separate configuration files, in a similar manner as other frameworks [35].These files contain the parameters that govern the pipeline and its components.In our experiments, we set the values for of each pipeline component as follows: Face detection uses a DNN OpenCV face detector with the default Tensorflow model.Face alignment uses the DAN algorithm with one of the default models provided by the authors (DAN-Menpo.npz ) [41].Real-time configurations can use a modified ERT model [42].DMRS uses a grid matrix of n = 9x9.For D KFD , we have selected the signals with a threshold of more than 0.85, and for DFA we set an α threshold between 0.75 and 1.0.We set the maximum number valid of regions r max = 32.Filtering is performed using FIR filters with Kaiser windows, with the parameter β = 25.The filters use a bandpasss configuration between 0.75 and 4 Hz (corresponding to 45-240 bpms).Signal Windowing uses sliding windows of 10 seconds and 1 second steps (9 seconds overlap).

Quantitative results
We provide an extensive evaluation of our three proposed pipelines and compare them with those of the baseline [35] with the standard configuration.We obtain results in six datasets.All datasets are comprised by videos with VGA resolution, but in two of them are heavily compressed (MAHNOB and COHFACE).We measure the performance by computing the average of the MAE, the standard deviation of the MAE, and the median of the Pearson Correlation Coefficient of the envelope of the heart rate.We evaluate the pipelines using ten rPPG methods (RGB to PPG signal conversion methods), including our proposed OMIT conversion.The results detail the impact of the improvements in each pipeline as shown in Table 1.
The Multi-region pipeline, with our proposed improvements, achieves the best results across all six datasets, improving MAE, error standard deviation, and Pearson Correlation Coefficient of the heart rate envelope.
Analyzing different rPPG conversion methods, CHROM and POS perform best in uncompressed databases across all pipelines, with OMIT closely following.OMIT works well in highly compressed pipelines, obtaining the best results for the challenging MAHNOB dataset.
Comparing results across datasets, the mean average error varies significantly depending on the data nature.Good quality and static datasets like UBFC or PURE have an error below 2 bpms, with minor differences across videos.For the LGI-PPGI dataset with natural movement, the best average error reaches nearly 4 bpms, with a reasonably high standard deviation.Worst results correspond to lower resolution datasets like MAHNOB and COHFACE, with average errors between 8 and 12 bpms.Heavy video compression and low illumination can cause low SNR and loss of signal subtleties [77].
In each dataset, the error presents variations along the different videos.This is shown in the standard deviation of the MAE, which reflects this variations.As a more detailed example, we computed the error and the Pearson Correlation Coefficient with its mean, maximum, minimum, and standard deviation for 9 RGB to PPG conversion methods, for the COHFACE and UBFC1 datasets and the Multi-region pipeline.We depict it graphically in Figure 10.It can be seen that for the high video quality dataset, there is some variability across rPPG conversion methods, while the variability among videos is relatively low.For the heavily compressed dataset, the error and its variability across videos is generally higher, but stable across different rPPG methods.Although some variations across methods, dataset, and pipelines exist, it is possible to conclude that the modification introduced in the pipelines shows a consistent improvement, although those databases with faces mostly still in front of the camera (PURE and UBFC), show only modest improvements when compared with those achieved in complicated datasets (LGI-PPGI, COHFACE, MAHNOB).
varied facial expressions, head movements, or illumination changes.To illustrate this, we disaggregate the results obtained on the LGI-PPGI dataset.This dataset is divided in four different scenarios.The resting scenario is a reference scenario where the participants are still in front of the camera with no head or facial movements and illumination, mostly static.In the rotation scenario, the subjects are also in the same setup, but the subjects perform a series of head motions and rotations.In the talking scenario, the subjects are in the wild (mainly in the street), under unconstrained conditions talking in video conference mode with sudden face and head motions.
The illumination is natural light with strong backlight conditions in some of the videos, provoking low dynamic range (LDR) images.The last session represents a sports scenario recorded in a gym, where the subjects freely perform a physical exercise on a static bicycle.The illumination is mainly ceiling lights.In some of the videos, we can appreciate the flickering effect due to the ceiling lights.These last two could represent typical scenarios where we would need to remotely extract physiological signals, such as during video conferences for remote healthcare or sport performance monitoring.We show the results in Table 2.We compare the baseline and multiregion pipelines using our proposed OMIT method for RGB to PPG conversion.

Qualitative Results
We provide a visual representation of heart rate estimations from various pipelines and methods in Figure 11.It shows the performance of four rPPG methods (CHROM, OMIT, POS, and GREEN) using the Multi-region pipeline on a PURE dataset video.CHROM, OMIT, and POS have similar performance, while GREEN struggles to track pulse rate during certain segments.The PCC metric reflects the similarity between the estimated (red) and ground-truth (blue) envelopes.Figure 12 shows the performance of four rPPG methods using the Multi-region pipeline on a MAHNOB database video.Despite the subject being static, high compression causes a loss of detail in raw RGB signals.CHROM and OMIT handle compression challenges better than GREEN and POS.

Evaluation of the number of regions
To evaluate the impact of the DMRS module in the Multi-region pipeline, we have designed a complementary experiment that measures how the results are affected depending on the number of facial regions used in the initial grid.We use CHROM as the RGB to rPPG conversion method, while all parameters remain the same except the number of initial available regions to select.In the comparison, in addition to region grids, we also include the typical fixed regions of the face, such as the forehead and cheeks, as depicted in Figure 13.For this experiment, the results are depicted in Table 3.They show how generally, a moderately large number of regions results in smaller errors.The approach using fixed patches shows comparable results to configurations with low number of regions and proved to be still useful in some cases.It can be seen that our method, relying on the multi-region pipeline obtains better results that all unsupervised non-learning-based methods across all six benchmark datasets.Our results are also comparable to some recent supervised methods that require training on videos taken in similar conditions.Additionally, as shown in paper [79], our method is highly efficient, taking only 17 ms per frame, and under 33 ms with face detection and alignment.This outperforms deep learning methods, which generally require longer processing times per frame, underscoring our approach's computational advantage.

Conclusion
In this article, we proposed a new unsupervised pipeline for the extraction of BVP signals from facial videos (rPPG).To enable a fair comparative evaluation among methods, we solved a set of smaller technical challenges such as problems with signal synchronization, use of different spectral analysis methods in extracted and reference signals, or inconsistent use of pipeline modules such as face detection and tracking or filtering.We proposed three novel contributions that improve the extraction of rPPG signals, especially in challenging conditions.First, we included a face normalization module, based on facial landmarks and a fixed triangle mesh that allowed the extraction of signals from exactly the same facial regions in a consistent manner.Second, we added the dynamic selection of facial regions that allowed to statistically discard those regions showing noise and artifacts.Finally, we proposed a novel RGB to PPG conversion method that increased the robustness of the extraction against compression artifacts.Our enhanced pipeline works in a purely unsupervised manner, and it is directly applicable in datasets collected in multiple conditions without any need of training data.The proposed pipeline achieves state-of the-art results across multiple databases when compared with other unsupervised methods and shows comparable results to other unsupervised methods.

Fig. 1 :
Fig. 1: Typical unsupervised based methodology for remote photoplethysmographic (PPG) imaging using a RGB camera.It comprises several steps: 1) Face detection and alignment 2) Skin segmentation 3) ROI selection 4) Extraction of the raw signals from ROIs 5) Filtered signals 6) RGB to PPG transformation 7) Spectral analysis and post-processing.

Fig. 2 :
Fig. 2: Facial landmark detection using DAN model under extreme head poses and frontal faces.

Fig. 3 :
Fig. 3: Delay between reference BVP and remote PPG signals from proposed pipelines, induced by factors like filtering, blood perfusion differences, or camera distance.

Fig. 4 :
Fig. 4: Estimated HR from the reference BVP signal and the extracted facial rPPG signal using the Baseline pipeline.We can observe the asynchrony due to the different signal sources.

Fig. 5 :
Fig. 5: Face normalization process.Left to right: detected landmark points, fixed triangle face mesh, and normalized face after mapping triangles to fixed coordinates.

Fig. 6 :
Fig. 6: Comparison of candidate regions of interest in a frontal face under good lighting conditions (top row) and a non-frontal face under suboptimal lighting conditions (bottom row).The system selects 32 regions in the frontal face and 20 regions in the non-frontal face.Grey regions are discarded.

Fig. 9 :
Fig. 9: Section of a reference PPG signal recorded with a fingertip contact-based pulse oximeter where we can observe an error of ≈ 2 seconds, probably due to movement of the finger and consequent lost of the signal tracking.

Fig. 10 :
Fig. 10: MAE in logarithmic scale for COHFACE (up) and UBFC1 (down) databases using the Multi-region pipeline with 9 different rPPG methods.The middle line in the boxes indicates the median at every method.

Fig. 11 :
Fig. 11: Comparison of HR estimation using four rPPG methods with the Multi-region pipeline.Estimated heart rate (red) is extracted from the face, and the contact-based reference PPG signal (blue) is computed on a single PURE dataset video.

Fig. 12 :
Fig. 12: Comparison of HR estimation using four rPPG methods with the Multi-region pipeline.Estimated heart rate (red) is extracted from the face, and the contact-based reference PPG signal (blue) is computed on a single MAHNOB video.

Fig. 13 :
Fig. 13: Region selection based on normalized fixed patches.From left to right: face landmark detection, forehead patch, cheek patches and both combined.

Table 1 :
Error comparison between the baseline pipeline and our three proposed improved pipelines.

Table 2 :
Performance of the Baseline and Multiregion pipelines (LGI-PPGI dataset) for different human activities.

Table 3 :
Impact of the number of regions in rPPG extraction using Face2PPG-Multi pipeline and CHROM.

Table 4 :
Comparison of Face2PPG-Multi pipeline with state of the art supervised learning-based (orange) and unsupervised non-learning-based (blue) methods.