Remote Blood Oxygen Estimation From Videos Using Neural Networks

Peripheral blood oxygen saturation (SpO2) is an essential indicator of respiratory functionality and received increasing attention during the COVID-19 pandemic. Clinical findings show that COVID-19 patients can have significantly low SpO2 before any obvious symptoms. Measuring an individual’s SpO2 without having to come into contact with the person can lower the risk of cross contamination and blood circulation problems. The prevalence of smartphones has motivated researchers to investigate methods for monitoring SpO2 using smartphone cameras. Most prior schemes involving smartphones are contact-based: They require using a fingertip to cover the phone’s camera and the nearby light source to capture reemitted light from the illuminated tissue. In this paper, we propose the first convolutional neural network based noncontact SpO2 estimation scheme using smartphone cameras. The scheme analyzes the videos of an individual’s hand for physiological sensing, which is convenient and comfortable for users and can protect their privacy and allow for keeping face masks on. We design explainable neural network architectures inspired by the optophysiological models for SpO2 measurement and demonstrate the explainability by visualizing the weights for channel combination. Our proposed models outperform the state-of-the-art model that is designed for contact-based SpO2 measurement, showing the potential of the proposed method to contribute to public health. We also analyze the impact of skin type and the side of a hand on SpO2 estimation performance.


I. Introduction
Peripheral blood oxygen saturation (SpO 2 ) is an important physiological parameter that represents the level of oxygenation in the blood and reflects the adequacy of respiratory function [1].The estimation and monitoring of SpO 2 are essential for the assessment of lung function and the treatment of chronic pulmonary diseases.It has been reported that COVID-19 patients can present with clinically significant hypoxemia or drop in blood oxygen saturation, yet many do not exhibit respiratory symptoms [2], [3].This highlights the importance of early detection of changes in SpO 2 to facilitate timely management of asymptomatic patients with clinical deterioration.Conventional SpO 2 measurement methods rely on contact-based sensing, including fingertip pulse oximetry and its variants in smartwatches and smartphones [4], [5], [6], [7].These contact-based methods present the risk of cross contamination between individuals using the same measurement device.An additional issue with contact-based methods is limb perfusion, especially in the digits.Circulation to the fingers and toes is often impaired in many cardiovascular and pulmonary diseases, which complicates measurement of SpO 2 .Also, pulse oximeters may not be widely available in marginalized communities and some undeveloped countries [8].
To provide a more comfortable and unobtrusive way to monitor SpO 2 that could be adopted in health screening and telehealth, a growing number of studies have investigated SpO 2 measurement using videos [9], [10], [11], [12], [13], [14], [15], [16], [17], which allows for SpO 2 estimation without contact.However, SpO 2 measurement using cameras in a contactless way, especially from smartphones, is challenging, because 1) the physiological signal is much weaker when acquired in a contactless setting compared the contact-based setting, and 2) the broadband absorption nature of the smartphone camera sensors lowers the optical selectivity to derive SpO 2 .Most prior art resorts to explicit feature extraction as a variant of the principle of pulse oximeters.Contrarily, in this paper, we introduce explainable convolutional neural network (CNN) models to extract features from all the three color channels holistically for SpO 2 measurement from contactless videos captured using consumer-grade smartphone cameras.To the best of our knowledge, there is no prior work that remotely monitors SpO 2 with regular RGB cameras using neural networks.
The overview of our proposed system for contactless SpO 2 monitoring using CNN from smartphones videos is shown in Fig. 1.First, the region of interest (ROI), including the palm and the back of the hand, is extracted from the smartphone captured videos.Second, the ROI is spatially averaged to produce R, G, and B time series.Next, the three time series are fed into an optophysiology-inspired CNN for SpO 2 estimation, which is designed based on the light-tissue interaction principle applied to physiological sensing and is aimed for better explainability of our proposed CNN.We consider the hand region in this work as a proof-of-concept.Compared to using the face for SpO 2 measurement as most of the prior art did [14], [16], recording hand videos raises less privacy concern and is a safer way for health condition screening and data collection during the COVID-19 pandemic according to the mask-wearing guidelines.
The contributions of our work are summarized as follows:

•
To the best of our knowledge, this is the first work to use the synergy of all color channels and optophysiology-inspired neural networks to address the challenging problem of contactless SpO 2 sensing using consumer-grade RGB cameras.
• Through a data-driven approach and visualization of the weights for the RGB channel combinations, we demonstrate the explainability of our model and that the choice of the color band learned by the neural network is consistent with the suggested color bands used in the optophysiological methods.

•
We analyze the impact of the two sides of the hand and different skin tones on the quality of SpO 2 estimation.

•
We achieve more accurate SpO 2 prediction with our optophysiologically inspired neural network structures when compared to the state-of-the-art neural network structure designed for this problem.

II. Background and Related Work
A. Blood Oxygen Saturation and the Ratio-of-Ratios Principle The protein molecule hemoglobin (Hb) in the blood carries oxygen from the lungs to the tissues of the body.The level of blood oxygen saturation (SpO 2 ) represents the ratio of oxygenated hemoglobin (HbO 2 ) to total hemoglobin and indicates the adequacy of respiratory function [1].The normal range of SpO 2 is 95% to 100% [1].SpO 2 is an important indicator of the ability of the respiratory system to meet metabolic demands.An abnormal drop in SpO 2 can serve as an early warning sign of inadequate oxygenation and clinical deterioration [1].A convenient and noninvasive way to continuously measure SpO 2 is pulse oximetry [4].
Pulse oximeters utilize the principle of the ratio of ratios that was first proposed by Aoyagi in the early 1970s [4], and pulse oximeters are commonly used today in hospitals, clinics, and homes.The ratio-of-ratios method leverages the optical absorbance difference of Hb and HbO 2 at two wavelengths, which conventionally are at red and infrared wavelengths as indicated in Fig. 2. For the commonly seen pulse oximeters, lights at the two wavelengths are emitted through the fingertip.The transmitted light, interacted and attenuated by the blood and tissue, and received by an optical sensor, conveys information about pulsatile blood volume.The pulsatile blood volume at the two wavelengths is further processed to produce an SpO 2 estimation.Specifically, SpO 2 is proved to be approximately linear to the ratio-of-ratios, which is computed as the ratio between the quotients of pulsatile/AC component and relatively stationary/DC component of transmitted pulse signals at red and infrared wavelengths [4].

B. Video Based SpO 2 Measurement
With the prevalence of smartphones, researchers have investigated methods of monitoring SpO 2 using smartphones, most of which are contact-based and require the fingertips to be pushed against the illuminated light source and the built-in camera [5], [6], [7], [19], [20], so that the diffusely reflected light by the fingertip is captured by the camera.In this setup, an adapted ratio-of-ratios model is utilized with the red and blue (or green) channels of color videos in lieu of the traditional narrowband red and infrared wavelengths.Those contact-based measurements can cause a sense of burning after several minutes of contact with the illuminated flashlight and is not suitable for sensitive skin or prolonged use.
Contactless SpO 2 measurement from videos, with weaker acquired pulse signal than contact sensing, are investigated to mitigate the issues from contact-based measurement.Based on the setup of cameras and light sources, existing noncontact, video-based SpO 2 estimation methods can be grouped into two main categories and both categories leverage the differences in the optical characteristics of Hb and HbO 2 .Methods from the first category utilize monochromatic sensing similar to conventional pulse oximetry.They use either high-end monochromatic cameras with selected optical filters or controlled monochromatic light sources [9],[10], [11], [12].The monochromatic light sources and sensors are selected so that they can have accurate control of the absorption effect of hemoglobins for precise SpO 2 measurement.The other category uses consumer-grade RGB cameras that are more accessible than the monochrome setup, such as digital webcams and smartphone cameras [14], [15], [16], [17].SpO 2 measurement using those cameras are more challenging because compared to the narrowband photosensors in pulse oximeters, each R, G, and B channel of the regular photographic cameras used by smartphones today senses a much wider range of wavelengths, so the aggregation of the broadband signals lowers the difference of optical sensing between oxygenated vs. deoxygenated hemoglobins, making it much less optically selective compared to the narrowband sensing systems.To address the issue in this challenging scenario, most previous noncontact SpO 2 measurement sticked to the explicit feature extraction and linear regression as the variants of the two-color-channel based ratioof-ratios methods [14],[15],[16], [17].In contrast, we address this issue by utilizing all the three color channels strategically.More specifically, in this paper, we use neural networks to distill the SpO 2 information from color channels in a holistic way.[24].An end-to-end convolutional attention network was proposed in [22] to estimate the blood volume pulse from face videos.Frequency analysis is then conducted on the estimated pulse signal for heart rate and breathing rate tracking.The study in [21] demonstrates that the heart rate can be directly inferred using a convolutional network with spatial-temporal representations of the face videos as its input.Mobile applications have been developed to utilize CNNs to measure body temperature from facial images [24].

C. Deep Learning
Deep learning for SpO 2 monitoring from videos is still in the early stage.Ding et al. [7] recently proposed a convolutional neural network architecture for contact-based SpO 2 monitoring with smartphone cameras.Even though the work in [7] showed better performance than the conventional ratio-of-ratios method, their technique requires the users' fingertips to be in contact with the illuminated flashlight and camera, which not only may lead to a sense of burning for a continuous period of time but also raises sanitation concerns, especially if the sensing device is shared by multiple participants during pandemics.Inspired by the optophysiological model for SpO 2 measurement [5], [10], [25], we develop a deep learning architecture to monitor SpO 2 in a contactless way with regular RGB cameras, which has the potential to be adopted in health screening and telehealth.
Table I compares our proposed method with existing SpO 2 measurement methods.We detail the uniqueness of our method as follows.Our contact-free method helps reduce the germ transfer and discomfort caused by the heat from contact with the flashlight on.It uses the hand as the ROI for better privacy protection.Its smartphone cameras based sensing is ubiquitous and convenient to use.Its neural network's structure was inspired from light-tissue interaction principles, which makes it more explainable as compared to black box style neural networks.Lastly, the proposed method adopts an implicit way of feature extraction and model fitting instead of strictly following the conventional ratio-of-ratios principle, which can better cope with the more challenging scenario of SpO 2 sensing.

III. Proposed Method for SpO 2 From Videos
We aim to estimate SpO 2 levels using a hand video based on the fact that the color of the skin changes subtly when red cells in the blood flow carry/release oxygen.Our proposed method, as illustrated at a high level in Fig. 1, extracts three color time-series from the skin area of the hand video.The extracted time series are then fed to optophysiology-inspired neural networks designed to achieve more explainable and accurate SpO 2 predictions.

A. Extraction of Skin Color Signals
The physiological information related to SpO 2 is embedded in the color of the reflected/ reemitted light from a person's skin.Hence, a preprocessing step that precisely extracts the color information from the skin area is crucial to the design of an effective SpO 2 estimation method.For each participant's video, we aim to extract the R, G, and B time series and refer to these 1-D time series as skin color signals.We first need to locate the ROI of the skin pixels from the video.We have found that it is most effective to discriminate the skin pixels from the background along the Cr axis of the YCbCr color space [26].
We use Otsu's method [27] to determine a threshold that best separates the skin pixels from the background by minimizing the variance within the skin and non-skin classes.Once the ROI corresponding to the hand is located, the R, G, and B time series are generated by spatially averaging over the values of skin pixels for each frame of the video.
The skin color signals are split up into 10-second segments using a sliding window with a step size/stride of 0.2 seconds to serve as the inputs for neural networks.From an optophysiological perspective, the reflected/reemitted light from the skin for the duration of one cycle of heartbeat, i.e., 0.5-1 seconds for a heart rate of 60-120 bpm, should contain almost the complete information necessary to estimate the instantaneous SpO 2 [4].In our system design, we use longer segments to add resilience against sensing noise.Since the segment length is one order of magnitude longer than the minimally required length to contain the SpO 2 information, we can use a fully-connected or convolutional structure to adequately capture the temporal dependencies without resorting to a recurrent neural network structure.

B. Neural Network Architectures
The previous neural network work for SpO 2 prediction mainly explored prediction, but not the model explainability [7].Explainability/interpretability is highly desirable in many applications yet often not sufficiently addressed, partly due to the black box nature of neural networks.From a healthcare standpoint, explainability is a key factor that should be taken into account at the beginning of the design of a system.Explainability is important for physiological and healthcare based applications to validate and provide strong justifications to users/clinicians that the proposed approach has a sound rationale and is trustworthy.
We have learned this firsthand from discussions with collaborating medical professionals.
Confirming that the neural network model is learning clinically relevant information will help ensure that the estimation capability can be replicated by others.This can be concerning if the performance varies too significantly across different subjects.Explainability may also be necessary for legal, certification, and regulatory purposes.Explainability in publiclyfunded research is also widely sought by governmental funding agencies to cope with taxpayers' increasing expectations of transparency.
To extract features from the skin color signals and estimate SpO 2 , we propose three physiologically motivated neural network structures.These structures are inspired by domain knowledge-driven physiological sensing methods and designed to be physically explainable.For heart rate sensing [21], [28] and respiratory rate sensing [29], the RGB skin color signals are often combined first followed by temporal feature extraction, as is done in the plane-orthogonal-to-skin (POS) algorithm [30].In contrast, for conventional SpO 2 sensing methods such as the ratio-of-ratios [25], the temporal features are extracted first and the color components are combined at the end.Our proposed neural network structures explore different arrangements of channel combination and temporal feature extraction.We want to systematically compare the performance of our explainable model structures.
Color Channel Mixing Followed by Feature Extraction: In Model 1, shown as the leftmost structure depicted in Fig. 3, we combine the color channels first using several channel combination layers and then extract temporal features using temporal convolution and max pooling.A channel combination layer first linearly combines the C in input channels/ vectors into C out activation vectors and then applies a rectified linear unit (ReLU) activation function to obtain the output channels/vectors.Mathematically, the channel combination layer is described as follows: where U ∈ ℝ C in × L is the input comprised of C in time series/vectors of length L. The initial channel combination layer has an input of three channels with 300 points along the time axis.W ∈ ℝ C out × C in is a weight matrix, where each of the C out rows of the matrix is a different linear combination for the input channels.A bias vector b ∈ ℝ C out contains the bias terms for each of the C out output channels, which ensures that each data point in the created segment of length L has the same intercept.1 T ∈ ℝ 1 × L is a row vector of all ones.The nonlinear ReLU function σ x = max 0, x is applied elementwise to the activation map/matrix.The output of the channel combination layer V ∈ ℝ C out × L contains C out channels of nonlinearly combined input channels.
The channel mixing section concatenates multiple channel combination layers with decreasing channel counts to provide significant nonlinearity.The output of the last channel combination layer has seven channels.After the channel mixing, for temporal feature extraction, we utilize multiple convolutional and max pooling layers with a downsampling factor of two to extract the temporal features of the channel-mixed signals.When there are multiple filters in the convolutional layer, there will also be some additional channel combining with each filter outputting a channel-mixed signal.Finally, a single node is used to represent the predicted SpO 2 level.This model has three channel combination layers, three feature extraction layers, and a total of 34K trainable parameters.
Feature Extraction Followed by Color Channel Mixing: In Model 2, which is the middle structure depicted in Fig. 3, we reverse the order of color channel mixing and temporal feature extraction from that in Model 1.The three color channels are separately fed for temporal feature extraction.The convolutional layers learn different features unique to each channel.At the output of the temporal feature extraction section, each color channel has been downsampled to retain only the important temporal information.The color channels are then mixed together in the same way as described for Model 1 before outputting the SpO 2 value.This model has three channel combination layers, 2 feature extraction layers, and a total of 12K parameters.
Interleaving Feature Extraction and Channel Mixing: In our third model, we explore the possibility of interleaving the color channel mixing and temporal feature extraction steps.As illustrated by the rightmost structure depicted in Fig. 3, the input is first put through a convolutional layer with many filters and then passed to max pooling layers, resulting in feature extraction along the time as well channel combinations through each filter.The number of filters is reduced with each successive convolutional layer, gradually decreasing the number of combined channels and downsampling the signal in the time domain.This model has 4 layers and a total of 307K parameters.

Loss Function and Parameter Tuning:
We use the root-mean-squared error (RMSE) as the loss function for all models.During training, we save the model instance at the epoch that has the lowest validation loss.The neural network inputs are scaled to have zero mean and unit variance to improve the numerical stability of the learning.The parameters and hyperparameters of each model structure were tuned using the HyperBand algorithm [31], which allows for faster and more efficient search over a large parameter space than grid search or random search.It does this by running random parameter configurations on a specific schedule of iterations per configuration, and uses earlier results to select candidates for longer runs.The parameters that are tuned include the learning rate, the number of filters and kernel size for convolutional layers, the number of nodes, the dropout probability, and whether to do batch normalization after each convolutional layer.

A. Dataset and Capturing Conditions
Our proposed models were evaluated on a self-collected dataset.The dataset consisted of hand video recordings and SpO 2 data from fourteen participants, of which there were six males and eight females between the ages of 21 and 30.Participants were asked to categorize their skin tone based on the Fitzpatrick skin types [32] shown in Fig. 4. The distribution of the participants' skin types is as follows: Two participants of type II, eight participants of type III, one participant of type IV, and three participants of type V.This research was using protocol #1376735 approved by the University of Maryland Institutional Review Board (IRB).
During data collection, each participant was asked to place his/her hands still on a table to avoid hand motion.Their palm of the right hand and the back of the left hand were facing the camera, as illustrated in Fig. 5.We refer to these two hand-video capturing positions as palm up (PU) and palm down (PD), respectively.Each participant was asked to follow the breathing protocol outlined in Fig. 6(a).In each breath-holding cycle, the participant breathes normally for 30-40 seconds, exhales all the way, and then holds his/her breath for another 30-40 seconds.This breath-holding protocol aims to induce a wide dynamic range of SpO 2 levels.The normal SpO 2 range for a healthy person is between 95% and 100%.By holding breath, SpO 2 can drop below 90%, as shown in the distribution of SpO 2 values in Fig. 6(b).This breath-holding cycle is repeated three times in each recording and each participant has two recordings with at least 15 minutes in between for both PU and PD hand-video capturing positions, resulting in a total of 56 recordings considering PU and PD separately.All videos were recorded using an iPhone 7 Plus.The participant's SpO 2 was simultaneously measured using a CONTEC CMS-50E pulse oximeter clamped to the left index finger of the hand.We use this pulse oximeter as the reference measurement as it has been validated to be within ±2% of the true SpO 2 level for the range of SpO 2 levels in our dataset.The video frame rate of the smartphone is 30 fps and the sampling rate of the pulse oximeter for the reference SpO 2 measurements is 1 Hz.
The reference SpO 2 signal is interpolated to 5 sample points per second to match the segment sampling rate using a smooth spline approximation [33].Each RGB segment and SpO 2 value pair is fed into our models as a single data point, the models output a single SpO 2 estimate per segment.To evaluate a model on a video recording, the model is sequentially fed with all RGB segments from the recording to generate a time series of preliminarily predicted SpO 2 values.All predictions greater than 100% SpO 2 are clipped to 100% based on physiological knowledge.A 10-second long moving average filter is applied to generate a refined time series of predicted SpO 2 values.

B. Participant-Specific Results
To investigate how well the proposed models could learn to estimate a specific individual's SpO 2 from his/her own data, we first conducted participant-specific experiments, that is, we learn individualized models for each participant.The participant-specific experiment is important because 1) it sets a baseline understanding of how a model performs when the participant is known, where we can develop a specific model for him/her to improve personalized healthcare; 2) it enables the analysis of the health status of an individual over time and contributes to the emerging digital twins paradigm for personalized healthcare [34].In each experiment, the model structure and hyperparameters are first tuned using the training and validation data.Once the model has been tuned, we train multiple instances of the model using the best tuned hyperparameters.Between each instance, we vary the random seed used for model weights initialization and random oversampling.Each model instance is evaluated on the training/validation recording, the model instance that achieves the highest validation RMSE is selected for evaluation on the test recording.This model is then evaluated on the test recording to obtain the final test results.

Results:
Table II shows the performance comparison of our proposed models with the prior-art model from Ding et al. [7] and the classic ratio-of-ratios method proposed by Scully et al. [5].To the best of our knowledge, Ding et al.'s model is the only convolutional neural network structure that has been proposed for contact-based SpO 2 estimation.Its structure is similar to our Model 3 but with fewer layers.We select [5] to compare because other noncontact SpO 2 methods listed in Table I, such as [9], [11], [12], [14], [15], [16], [17], are variants of it with different light source/sensor setups while their algorithmic core is still based on the ratio-of-ratios principles that requires explicit AC/DC feature extraction and machine learning algorithms to model the relation between the values of the ratio-of-ratios and SpO 2 .
The performance is measured in Pearson's Correlation, mean absolute error (MAE), and root mean squared error (RMSE) and results of each condition are summarized in the median and interquartile range (IQR).IQR quantifies the spread of an empirical distribution of a set of data points by computing the difference between the first quartile and the third quartile of the distribution.We use IQR to describe the middle spread of the data as well as to distinguish outliers.[5] achieves the best (lowest) RMSE, its correlations are the worst (lowest).This suggests that the classic ratio-of-ratios method cannot track the trend of SpO 2 well using the contactless measurement by smartphone.All of our model configurations outperform Ding et al. [7].For example, in the PU case for Model 3, the correlation is improved from 0.34 to 0.41 and the MAE is lowered from 3.40% to 1.81%.
It is worth noting that the international standard for clinically acceptable pulse oximeters allows an error of 4% [36], and our estimation errors are all within this range.
There are two factors, including the skin type and the side of the hand, which might influence the performance of SpO 2 estimation.We therefore investigate the following two questions: (1) Whether the different skin types matter in PU or PD cases, and (2) whether the side of hand matters in lighter skin (types II + III) or darker skin (types IV + V).The box plots in Fig. 9 summarize the distributions of the test correlations from all the three proposed models in PU and PD modes of (a) lighter-skin and darker-skin participants, and (b) all participants.

Bayesian statistical test:
We use Bayesian statistical tests to further analyze the results in Fig. 9 by providing a probabilistic assessment of whether the results from two groups being compared have the same mean [37], [38], [39], [40], [41].We avoid using the popular t-test because it makes only a binary decision due to its lack of direct information about the probability of difference between group means of the given data [38].In contrast, the Bayesian statistical test computes the posterior distribution of difference between the two group means to quantify its certainty of possible values [37].The decision rule of the Bayesian statistical test for the null hypothesis that the two groups have the same mean can be stated as follows given the region of practical equivalence (ROPE) of zero difference [40]: To answer question (1) about the impact of the skin type on the prediction performance, we focus on the left panel of Fig. 9(a).For the PD case, only 14% of the posterior distribution of the difference between the means of the lighter and darker skin groups falls in the ROPE.
For the PU case, 23% of the posterior distribution falls in the ROPE.This suggests that it is highly credible to conclude that the skin type makes a difference for SpO 2 prediction, and the difference is more certain to be observed when using the back of the hand as the ROI compared to using the palm.
To answer question (2), we first focus on the left panel of Fig. 9(b) when participants of all skin colors are considered together.33% of the posterior distribution of the difference of means between PU and PD cases falls in the ROPE.We then zoom into the darker skin group as shown in the left panel of Fig. 9(a), only 15% of the posterior distribution of the difference of means between PD and PU cases falls in the ROPE, whereas for the lighter skin group, 31% of the posterior distribution falls in the ROPE.This implies that it is highly credible that the side of the hand may have some impact on SpO 2 prediction, especially when concerning mainly the darker skin group.

C. Leave-One-Participant-Out Results
To investigate whether the features learned by the model from other participants are generalizable to new participants whom it has not seen before, we conduct leave-oneparticipant-out experiments.For each experiment, when testing on a certain participant, we use all the other participant's data for training and leave the test participant's data out.The recordings from all the non-test participants are used for participant-wise cross-validation to select the best model structure and hyperparameters.The selected model is evaluated on the two recordings of the test participant, whose data was never seen by the model during training.
Table III shows the performance comparison of each model in leave-one-participant-out experiments.Model 1 achieved the best performance in terms of correlation and achieved the best MAE and RMSE for the PU case.Similar to the participant-specific case, the classic ratio-of-ratios method proposed in Scully et al. [5] achieved better MAE and RMSE results for the PD case but the correlation result was low, suggesting that the model achieved low error by simply predicting a nearly constant SpO 2 near the middle of the SpO 2 range.The best performance of Model 1 in the leave-one-participant-out experiment may imply that the features extracted after combining the color channels at the beginning of the pipeline can be generalized better to unseen participants than the features extracted before channel combination or through interleaving as in Models 2 or 3.
In the participant-specific case, the model is specifically tailored to the test individual, whereas the leave-one-participant-out case is more difficult because the model needs to accommodate for the variation in the population.As expected, in Fig. 9, we observe that the overall results from the leave-one-participant-out experiments do not match those from the participant-specific experiments.Because of the modest size of the dataset, the model has not seen as diverse data as a larger and richer dataset would offer.The generalization capability to new participants can be improved when more data is available.
We now revisit the two research questions raised in Section IV-B under the leave-oneparticipant-out scenario.First, we analyze the impact of skin type given the same side of the hand.From the right panel of Fig. 9(a), in the PD case, only 0.04% of the posterior distribution of the difference of means between lighter and darker skin groups is within the ROPE, suggesting that the null hypothesis is rejected and the darker skin group outperforms the lighter skin group.In the PU case, 18% of the posterior distribution falls within the ROPE.This observation is consistent with the participant-specific experiments that when using the back of the hand as the ROI, the skin color is more credible to be a factor in the accuracy of SpO 2 estimation than using the palm.
Second, we analyze the impact of the side of the hand for two skin color groups.For the darker skin group shown in the right panel of Fig. 9(a), only 9% of the posterior distribution of the difference of means of the PU and PD cases falls in the ROPE.This shows that there is a high uncertainty in the estimate of zero difference, which is consistent with the results from the participant-specific experiments.However, unlike the participant-specific experiments, for the lighter skin group, 0.2% of the posterior distribution of the difference of means between PU and PD cases falls in the ROPE.This suggests that the null hypothesis is rejected and that the PU outperforms the PD in the lighter skin group.As for the mixed group illustrated in the right panel of Fig. 9(b), only 8% of the posterior distribution of the difference of means falls in the ROPE, suggesting that there is a high uncertainty to conclude that PU and PD cases are comparable.
This different generalization capability in the PU and PD cases may be attributed to the skin color difference between the palm and the back of the hand.The color of the back of the hand tends to be darker than the color of the palms and has larger color variation among participants due to different degrees of sunlight exposure.In contrast, the color variation of the palms is much milder among participants.Furthermore, in the participant-specific experiments, the individualized models learn the traits of the skin type and the side of the hand from each participant, whereas, in the leave-one-participant-out experiments, the learned model must capture the general characteristics of the population.

D. Ablation Studies
To justify the use of nonlinear channel combinations and convolutional layers for temporal feature extraction in our proposed models, we conduct two ablation studies comparing the performance of these model components to other generic ones.We focus on the PU case to avoid the uncontrolled impact of such factors as skin tone and hair.In the first ablation study, we compare nonlinear to linear channel combination.We create a variant of Model 1 with only a single linear channel combination layer with no activation function and repeat the leave-one-participant-out experiments.In the second study, we compare the performance of using convolutional layers for temporal feature extraction to using fully-connected dense layers.We create this second variant of Model 1 and repeat leave-one-participant-out experiments.
Table IV presents the medians and IQRs specified for numerical comparison of the ablation study.First, we compare the first and the third rows in Table IV for ablation study 1.Our proposed Model 1 achieves a better correlation with a median of 0.46 and IQR of 0.36 and a better RMSE with a median of 2.32 and IQR of 0.87 than its linear channel combination variant.Besides, Model 1 achieves a comparable MAE with a better median of 1.97 but a wider IQR of 0.80.The overall better performance of Model 1 suggests the necessity of using the nonlinear channel combination method.Second, in ablation study 2, we compare the second and the third rows in Table IV.We observe that Model 1 outperforms its second variant with fully connected layers for feature extraction with better medians in terms of correlation (0.46 vs. 0.41), MAE (1.97 vs. 2.29), and RMSE (2.32 vs. 2.66), and narrower IQR of correlation.This suggests that convolutional layers are better than fully connected layers for temporal feature extraction.

A. Contact-Based Dataset Testing
We also test our models on the publicly available dataset gathered by Nemcova et al.
for their SpO 2 estimation work [19].This dataset consists of contact-based smartphone video recordings where a participant placed a finger on the smartphone camera and was illuminated by the camera flashlight.Participants were asked to breathe normally without following any sophisticated breathing protocol.Each recording lasts about 10 to 20 seconds.The subject for each recording is not identified, so subject-specific and leave-oneparticipant-out experiments cannot be conducted.There is a single reference SpO 2 value associated with each recording.We used 14 recordings for training and seven recordings for testing and compared them with the modified ratio-of-ratios method proposed in their paper.
As shown in Table V, Models 1 and 2 outperform the method used by Nemcova et al. on both the training and test recordings.Model 3 is not able to generalize well from the training set to the test set, which may be due to the small size of the dataset.It should be noted that because the participants were not asked to follow any sophisticated breathing protocol, the dynamic range of SpO 2 values is narrow.These results show that our CNN Models 1 and 2 work well for contact-based video recordings in addition to contactless videos recordings.

B. Ability to Track SpO 2 Change
By employing the standard machine learning methodology of training-validation-test split in Section IV to learn neural networks that perform well on unseen data, we have already ensured the generalizability of our models [43,Chapter 11].As further evidence that our models are capable of outputting meaningful predictions, we compare SpO 2 predictions from our learned models to randomly generated SpO 2 values.For each reference signal, a random prediction signal was generated by choosing SpO 2 values between the minimum and maximum values from the reference signal and applying a moving average window in the same way as is applied to the neural network predictions.

C. Visualizations of RGB Combination Weights
To understand and explain what our physiologically inspired models have learned, we conduct a separate investigation to visualize the learned weights for the RGB channels.
Our goal is to understand the best way to combine the RGB channels for SpO 2 prediction.
Having an explainable model is important for a physiological prediction task like this.Our neural network models can be considered as nonlinear approximations of the hypothetically true function that can extract the physiological features related to SpO 2 buried in the RGB videos.The ratio-of-ratios method [4], [45], for example, is another such extractor that combines the information from the different color channels at the end of the pipeline.For this experiment, we use the modified version of Model 1 from the ablation studies that has only a single linear channel combination at the beginning.Seeing that using a single linear channel combination did not significantly reduce model performance in the ablation studies, and understanding that the linear component may dominate the Taylor expansion of a nonlinear function, we use only linear combinations for this model to facilitate more interpretable visualizations.
We have trained 100 different instances of the model on the first two cycles from all the recordings and tested on the third cycle from all recordings.The difference between each instance is that the weights are randomly initialized.The weights for each channel learned by the model instances were visualized as points representing the heads of the linear combination vector in RGB space.Each point is colored according to the average test correlation achieved by the model instance.Figs.12(a) and 12(b) show the projections of these points onto the RB and RG planes.The subfigures reveal that the majority of the channel weights lay along certain lines in the RGB space.For the weights on the line, the ratio of the blue channel weight to the red channel weight is 0.87, the ratio of the green channel weight to red channel weight is 0.18.It is clear that the red and blue channels are the dominating factors for SpO 2 prediction.
To further verify this result, we repeat this experiment under the following setup: instead of using the data from all participants, for each model instance, we randomly select seven participants and use their data for training and testing.In this case, the difference between each model instance is not only the initialized weights but also the random subset of participants that the model was trained on.Fig. 12(d) reveals that most of the betterperforming instances (with ρ ≥ 0.45) have little contribution from the green channel.In Fig.
12(c), we again see that most of the points lay on a line in the RB plane, the ratio of the blue channel weight to the red channel weight for these points is 0.80.
These results are in accordance with the biophysical understanding of how light is absorbed by hemoglobin in the blood.Recall that Fig. 2 reveals a large difference between the extinction coefficients, or the amount of light absorbed, by deoxygenated and oxygenated hemoglobin at the red wavelength.There is a significantly smaller difference at the blue wavelength and almost no difference at green.The amount of light absorbed influences the amount of light reflected which can be measured through the camera.A larger difference in extinction coefficients makes it easier to measure the ratio of light absorbed by oxygenated vs. deoxygenated hemoglobin over time.This ratio indicates the level of blood oxygen saturation.Therefore, from a physiological perspective, it makes sense for the neural networks to give larger weight to the red and then blue channels and give little to the green channel.These visualizations indicate that the models are learning physically meaningful features.

VI. Conclusion and Future Work
In this paper, we have proposed the first CNN-based work to solve the challenging problem of video-based remote SpO 2 estimation.We have designed three optophysiologically inspired neural network architectures.In both participant-specific and leave-one-participantout experiments, our models are able to achieve better results than the state-of-the-art method.We have also analyzed the effect of skin color and the side of the hand on SpO 2 estimation and have found that in the leave-one-participant-out experiments, the side of the hand plays an important role, with better SpO 2 estimation results achieved in the palm-up case for the lighter-skin group.We have also shown the explainability of our designed architectures by visualizing the weights for the RGB channel combinations learned by the neural network, and have confirmed that the choice of the color band learned by the neural network is consistent with the established optophysiological methods.
In future work, we plan to investigate the impact of decreasing the measurement duration since shorter measurement duration is generally preferred in clinical applications.Another direction is to investigate the effectiveness and performance of the proposed SpO 2 estimation method in clinical or physiological applications while the participants are in motion.The overall pipeline for our proposed method.The hand is first recorded and then the hand and skin pixels are segmented from the background.The average pixel values of the hand for each color channel is calculated frame by frame.This results in an RGB time series which is fed into the neural neural network model for SpO 2 estimation.Extinction coefficient curves of hemoglobin showing the absorption properties at different wavelengths.The curves were plotted based on [7], [18].The difference between oxygenated hemoglobin (HbO2) and deoxygenated hemoglobin (Hb) at the red and blue/ infrared wavelengths means that these color channels contain useful information for SpO 2 prediction by means of optophysiological principles.Posterior distribution of the difference of group means.This shows an example of an undecided case of the Bayesian statistical test given that the ROPE of zero difference is set to [−0.03, 0.03] and 33% of the posterior distribution falls within the ROPE.The percentage of coverage can be used to quantify the certainty that two groups have the same mean.
Setting: Since each participant has two recordings for each of the PU and PD cases, one recording is used for training and validation of the model and the other recording is for testing.An example of the training and validation predictions curves is shown in Fig.7(a).Each recording contains three breathing cycles, for each training/ validation recording, the first two breathing cycles are taken for training and the third cycle is used for validation.Splitting the recordings into cycles instead of randomly sampling the 10-sec overlapping RGB segments ensures that there are no overlapping segments of data between the training and validation set.Example test prediction curves and their Pearson's correlation and mean absolute error (MAE) are shown for reference in Fig.7(b).The Pearson's correlation shows how well the predicted signal follows the rising and falling trend in the reference signal.It should be noted that if the correlation is low, e.g., a constant temporal estimate, then the MAE and RMSE metrics are less meaningful as shown in Fig.8(a).It is also possible for the correlation to be high but the absolute difference between the predicted and reference signal is large as shown in Fig.8(b).For the participant-specific experiments, due to the small dataset size, we augment the training and validation data by sampling with replacement.This is an example of the bootstrapping data reuse strategy [35,Chapter 5].The oversampling also helps address the imbalance in SpO 2 data values that is shown inFig.6(b).

Fig. 11 (
a) shows a histogram of the correlations between the reference SpO 2 signals and the randomly generated predictions and Fig. 11(b) shows a histogram of correlations between the reference SpO 2 signals and the predictions generated by Model 2. It is revealed that the neural network with a median correlation of 0.41 1 outperforms random guessing with a median correlation of −0.02, confirming Model 2's capability to track SpO 2 .

Fig. 3 .
Fig. 3. Proposed network structures for predicting an SpO 2 level from a fixed-length segment of skin color signals.We highlight the differences among the three model configurations instead of showing the exact model structures.Model 1 combines the RGB channels before temporal feature extraction.Model 2 extracts the temporal features from each channel separately and fuses them toward the end.Model 3 interleaves color channel mixing and temporal feature extraction.

Fig. 5 .
Fig. 5. Setup for capturing hand videos.Over the top is an smartphone device recording a single video for both hands.The left and right hands are in the palm down (PD) and palm up (PU) positions, respectively.The CMS-50E pulse oximeter clamped to the left index finger records reference SpO 2 signals.

Fig. 6 .
Fig. 6.(a) Breathing protocol that participants were asked to follow, including 3 cycles of normal breathing and breath holding.(b) Histogram of SpO 2 values in the collected dataset.

Fig. 7 .
Fig. 7. (a) Training vs. validation predictions.(b) Test predictions of varying performance with reference SpO 2 .The higher the Pearson's correlation, the better the prediction captures the reference SpO 2 trend.The lower the MAE, the better the prediction captures the dips in SpO 2 .

Fig. 8 .
Fig. 8.Cases where the proposed model fails to accurately predict SpO 2 .(a) The model predicts a near constant SpO 2 that results in small MAE but also low correlation.(b) The model prediction follows the general trend of the reference SpO 2 signal but does not capture the magnitude of the dips.This results in a high correlation but also large MAE.

Fig. 9 .
Fig. 9.Box plots comparing distributions of correlations for (a) lighter vs. darker skin types, and (b) PD vs. PU for all skin types.The PD results are better for darker skin tones in both the participant-specific and leave-one-out cases.

Fig. 11 .
Fig. 11.Histograms of correlation values between reference SpO 2 signals and (a) randomly generated SpO 2 signals, or (b) SpO 2 signals predicted by neural network Model 2. The correlation distribution for Model 2 is centered much higher than the random guess, confirming Model 2's capability to track SpO 2 .

Fig. 12 .
Fig. 12. Learned RGB channel weights.Plots (a) and (b) are the channel weights learned by different model instances trained on the data of all study participants together, projected onto the RB and RG planes in the RGB space.Plots (c) and (d) are the RB and RG projections of the learned channel weights for model instances trained on random subsets of the participants' data.Each point is color-coded according to the correlation ρ achieved by the instance.

Table
II reveals that Model 2 achieves the best correlation in both PD and PU cases, whereas Model 3 achieves the best MAE and a comparable correlation with Model 2, suggesting that Model 2 and Model 3 are comparably the best in the individualized learning.Even though the method proposed in Scully et al.
When the null hypothesis is neither accepted nor rejected, the percentage of the posterior distribution of the group-mean difference inside the ROPE can be used to quantify the certainty that two group means are the same.One example is shown in Fig.10.

TABLE II Performance
Comparison of Each Model Structure for Participant-Specific Experiments Results are given as the test median and IQR of all participants.
IEEE J Biomed Health Inform.Author manuscript; available in PMC 2023 September 01.

TABLE III Performance
Comparison of Each Model Structure in Leave-One-Participant-Out Experiments Results are given as the test median and IQR of all participants.IEEE J Biomed Health Inform.Author manuscript; available in PMC 2023 September 01.

TABLE IV Numerical
Results of the Ablation Studies for Model 1 (M1) in the Leave-One-Participant-Out Mode Linear Ch.Comb.+Conv.layer for Feat.Extra.Proposed): Nonlinear Ch.Comb.+Conv.layer for Feat.Extra.Comparisons among the Proposed (nonlinear) M1, Modified M1 with only linear channel combinations, and Modified M1 with fully connected dense layers instead of convolutional layers are listed.Ablation studies confirm that the nonlinear channel combinations and convolutional layers improve model performance.