Towards a Comprehensive Solution for a Vision-Based Digitized Neurological Examination

The ability to use digitally recorded and quantified neurological exam information is important to help healthcare systems deliver better care, in-person and via telehealth, as they compensate for a growing shortage of neurologists. Current neurological digital biomarker pipelines, however, are narrowed down to a specific neurological exam component or applied for assessing specific conditions. In this paper, we propose an accessible vision-based exam and documentation solution called Digitized Neurological Examination (DNE) to expand exam biomarker recording options and clinical applications using a smartphone/tablet. Through our DNE software, healthcare providers in clinical settings and people at home are enabled to video capture an examination while performing instructed neurological tests, including finger tapping, finger to finger, forearm roll, and stand-up and walk. Our modular design of the DNE software supports integrations of additional tests. The DNE extracts from the recorded examinations the 2D/3D human-body pose and quantifies kinematic and spatio-temporal features. The features are clinically relevant and allow clinicians to document and observe the quantified movements and the changes of these metrics over time. A web server and a user interface for recordings viewing and feature visualizations are available. DNE was evaluated on a collected dataset of 21 subjects containing normal and simulated-impaired movements. The overall accuracy of DNE is demonstrated by classifying the recorded movements using various machine learning models. Our tests show an accuracy beyond 90% for upper-limb tests and 80% for the stand-up and walk tests.


I. INTRODUCTION
THE burden and prevalence of neurological disorders [1] and the national shortage of neurologists [2] continue to grow hand in hand. This increases disparity through unequal access to clinical care and drives worsening clinician burnout rates. Meanwhile, the COVID-19 pandemic has boosted the transition from in-person to virtual neurological examinations [3], [4] through teleneurology (TN) platforms. Raplidly developing TN has shown potential in making efficient assessments remotely [5]- [7] and helping in distributing scarce healthcare resources and enhancing accessibility to neurological care [8], [9]. In addition, digital biomarker exam solutions with quantification of physical evaluations that bypass clinician availability and subjectivity of assessments [10] are important to improve care and compensate for the shortage of neurologists.
Current digital biomarker exam systems are devoted to a single neurological test [11]- [13], require advanced setups/equipment [14], or lack automated assessments [15], [16]. Therefore, a digital biomarker solution, 1) suitable for use by neurologists and nonneurologists, 2) with wide applicability at clinics or home, 3) that is easy to deploy, 4) supports a wide range of neurological tests, and 5) enables automated objective quantitative evaluations, would significantly advance health care delivery.
For this purpose, in this work, we introduce an end-to-end vision-based exam and documentation platform named Digitized Neurological Examination (DNE). As part of DNE, we designed an easy-to-use smartphone/tablet software with predefined examination instructions. The DNE software allows the users to video record their performance on several neurological screening examinations, including finger tapping (FT), finger to finger (FTF), forearm roll (FR), and stand-up and walk (SAW). These recordings are uploaded to a secure cloud-based storage. In an offline step, for each recording, 2D/3D pose, estimating the location of major human body keypoints, is extracted using deep-learningbased solutions such as OpenPose [17], and VideoPose3D [18]. From the estimated pose, unified digital biomarkers, including spatio-temporal and kinematic features, are computed [19]. We showcase the performance of our system on a dataset collected from 21 healthy subjects taking different neurological tests (FT, FTF, FR, SAW) when their function is normal or with a simulated impairment. We incorporate our defined features in a variety of machine learning models to detect abnormal functioning in our dataset. Fig. 1 illustrates the capabilities our DNE system. We summarize the key contributions of this work as:

•
We develop a unified and modular software package for high-quality DNE recording collection. Our DNE software is easy-to-use, allows the integration of new tests, and runs on handheld iOS devices. We also implement a web-based dashboard for viewing the recordings and feature visualization.

•
We propose a vision-based approach to study various neurological tests (FT, FTF, FR, and SAW). For each test, we define clinically interpretable kinematic and spatiotemporal quantified features.

•
To the best of our knowledge, we are the first to construct a vision-based dataset consisting of multiple neurological tests and simulated-impaired video recordings per subject alongside the extracted 2D/3D pose. Analyzing this dataset allows us to have a normal self-baseline for each abnormal recording and test the power of the extracted features in distinguishing normal from abnormal performance. Our dataset (excluding RGB videos due to privacy restrictions) and code will be available at https://dneproject.web.illinois.edu/.
The organization of this paper is as follows. Section II summarizes recent studies on digital biomarker systems. Section III describes DNE's software platform used in our data collection. Section IV introduces our DNE dataset. We define our features in detail in Section V. Section VI contains our analysis results while Section VIII draws our main conclusions.

II. RELATED WORK
In this section, we review the related literature to different tests (FT, FTF, FR, SAW). For each test, we briefly discuss the existing sensor, web/smartphone and vision-based solutions.

Finger Tapping (FT):
Sensor-based FT assessments study spectral analysis of gyroscope data [20], opening finger tap velocity captured by accelerometers [21], standard deviation, range and entropy measured by a collection of sensors including synchronized wrist watches, pressure sensors and accelerometers [22]. Several smartphone based applications [23]- [26] are designed to quantitatively evaluate various symptoms and motor skills in patients with Parkinson's Disease (PD). While these approaches are proven effective and low cost, their measurements are not as informative as vision-based methods, relying on video data and simulating in-person clinical examinations. Among vision-based pipelines, [11], [27]- [29] extract a set of kinematic interpretable features from the tracked positions of the fingers given an RGB video. These features are easy to explain and associate with clinical symptoms. On the other hand, black box deep learning models operating on the estimated finger poses and their derivatives are proposed in [30]. While these solutions provide high accuracy, unlike our DNE, they lack explainability and require large training sets to generalize and avoid overfitting.

Finger to Finger (FTF):
A well-studied test in the literature that is similar to FTF in terms of measuring smoothness and upper extremity coordination is the finger to nose test. Among sensor-based methods, Rodrigues et al. in [31] investigates the coordination ability of patients with chronic stroke versus healthy control using a complex marker-based motion analysis system. Oubre et al.
[32] studied ataxia through wearable inertial sensors and a computer tablet version of finger to nose test. Furthermore, predicting severity levels of ataxia or PD via a rapid web-based computer mouse test is explored in [33]. Jaroensri et al. [12] is among the first to propose vision-based solutions that are on par with a specialist in terms of rating the severity scale of PD while using estimated joint positions from recorded videos.

Upper Limb Tests:
To the best of our knowledge, sensor-based or vision-based studies related to the forearm roll task are scarce. Thus, here we further overview the existing methods devoted to the study of upper limb movements. Using wearable sensors, Cruz et al. in [14] assessed the acceleration, velocity or smoothness of the upper limb motor function of patients after stroke. A low-cost Kinect based solution, tracking subjects' hand when asked to move a marker on a rectangular pattern is proposed in [34]. The range of motion is analyzed using an internet-based goniometer in [35]. In [36], the authors describe a vision-based system that captures upper limb motions via multiple cameras installed at different views. While this multi-camera system is less sensitive to occlusions and dynamic backgrounds, unlike our DNE system, it requires a special setup which is hard to install for home-use.

Stand-up and Walk (SAW):
In our review of gait analysis literature, we focus on the marker-less [37] vision-based solutions, mainly measured using general handheld cameras and mobile devices. In early efforts for marker-less gait analysis, silhouettes are extensively used to detect heel-strike and toe-off occurrences. These two events refer to the first and last ground contact of each foot, later on adopted to accurately estimate important gait parameters [38]- [41]. However, these methods are restricted to specific laboratory settings and are sensitive to the quality of foreground/background segmentation. The surge of research in the human pose estimation field [42]- [44] brought along popular deep learning frameworks which accurately estimate the 2D/3D location of body joints from different inputs including RGB image, video and depth maps [17], [18], [45], [46]. Depth-map based gait assessment solutions relying on the estimated pose from either depth or RGBD [47], [48], have studied the rotational angle and angle velocity of certain body keypoints [49] and evaluated the spatio-temporal gait metrics such as step length and time [13], [50].
Wei et al. [16] introduced an automated smart-phone based video capturing system with hand/body pose estimation. While neurological exams such as gait are considered in [16], feature extraction and analysis is not studied and the main focus is on the quality control of the video acquisition process. Using the estimated pose from OpenPose [17], Xue et al. [13] studied the remote monitoring of gait parameters for senior care. Furthermore, [51] reports timings of different segments of the timed-up-and-go (TUG) test by performing frame-based activity classification based on 2D pose data. To assess the freezing of gait (FoG) symptom in Parkinson patients, [52] proposed the use of frequency analysis methods while [53] adopted graph convolutional neural networks to attain the probability of FoG from pose data. Kidziński in [54] employed black-box deep learning models to estimate the level of movement disorder in children suffering from cerebral palsy. Despite their promising results, deep learning based solutions are less interpretable and require large training supervised datasets for better generalization.

III. SYSTEM DESIGN
As part of DNE, we developed three software packages to maintain data acquisition, analysis and results report.

DNE Recorder:
This module accommodates easy-to-use self or assisted video recording on a set of predefined neurological tests. DNE Recorder is an iOS mobile application. It includes detailed instructions on how to perform each test alongside automated video capturing functions. Our software facilitates recording of high quality depth maps on devices equipped with LiDAR. We collect 1080 × 720 high-quality RGB, depth videos (upon applicable hardware) and camera calibration parameters at 60 frames per second (FPS). All recordings are synchronized into a secure cloud storage for offline processing. The user interface of this module is shown in Fig. 2(a).

DNE Analyzer:
We analyze the RGB recordings offline in a separate module. The main components of DNE Analyzer include 1) vision-based pose estimation, 2) feature extraction, 3) abnormality detection.

DNE Viewer:
We provide a secure web application for clinicians, neurologists and researchers to monitor raw recordings and view the analysis results from all subjects remotely. Fig. 2 (b) displays a screenshot of the DNE Viewer user interface.

IV. DATASET COLLECTION
Our dataset collection protocol is IRB approved (#IRB.1452500) on 02/27/2020 by the University of Illinois College of Medicine at Peoria Institute Review Board 1. In this study, 21 healthy volunteers (18 females/3 males) were recruited by sampling of convenience at the OSF HealthCare Illinois Neurological Institute Outpatient Neurology Clinic (Peoria, IL). Neurological examinations examine fine motor and mobility abilities. We study the FT, FTF, FR for fine motor tasks, and evaluate the mobility by the SAW test. Below we describe in detail how these tasks are performed.
• FT: Participants are instructed to put their hands within the camera view when their index fingers and thumbs were touched. Then they would start tapping them as big open and close, and fast as they could for 15 seconds.
• FR: Participants are asked to gently clench their hands, hold their forearms horizontally, and roll their hands around each other as fast as possible for 15 seconds.
• FTF: Participants repetitively first point their index fingers towards the ceiling and then touch their fingers together in front of their chests for a duration of 15 seconds.
• SAW: Participants stand from a sitting pose in a chair, move the chair out of the way, walk back and forth 15 feet. The designated time for SAW test is 45 seconds.
Each subject took two sets of neurological examinations supervised by a neurologist. In the first set of examinations, the subjects performed the tasks normally. However, for the second set, the subjects were asked to simulate motor dysfunction, i.e. perform the test abnormally. For this purpose, the subjects wore devices to deliberately add disruption to their performance and mimic impairments. For FT, a rubber band is used to restrict movements of the index and thumb fingers. For the FR and SAW tests the subjects put on a left wrist and a knee brace, respectively. On the other hand, for the FTF test, the subjects were asked to deliberately mimic a tremor pattern in moving their fingers and hands. Snapshots of recordings and subjects wearing the devices are exhibited in Fig. 3.
Both set of recordings are acquired by our DNE Recorder on iPad 11 Pro and iPhone 11 devices. For upper body tests, we have a close-up frontal view of the subjects with visible pelvis. Moreover, to assess the invariance of our analysis under small deviations from the frontal camera view, the view of the recordings taken on iPhone is slightly to the left compared to the iPad recordings. In addition, for the SAW, we record both saggital and frontal views, using iPad and iPhone, respectively. In total, including all four tests (FR, FT, FTF, SAW), we collect 375 videos. Table I provides a summary of our dataset.
While there is hardly any similar publicly available upper-body neurological related dataset, there are several datasets studying gait impairments specifically in [13], [38], [39], [52], [54]. The closest to our dataset is KIMORE [55] focusing on rehabilitation exercises rather than neurological tests. The KIMORE provides RGB, depth, and pose data for each recording, collected by Kinect v2 which is not as ubiquitous as handheld devices adopted in DNE. In Table II, we compare our dataset versus state of the art public gait impairment datasets in various aspects. For this comparison, we only focus on studies using a single-view, portable camera for data collection, similar to our setting. Accordingly, we list the contributions introduced by our dataset as: 1) This is the first public dataset studying multiple neurological test segments. 2) Our dataset includes normal and abnormal performance of the same task for each particular subject. 3) Our dataset contains multiple data modalities, including depth videos, camera parameters, and 2D/3D pose estimation.

V. DNE VISION-BASED ANALYSIS
In our DNE analysis pipeline, given an RGB video, we first compute the human pose in each frame. Next, from the pose time series, we extract a set of features that quantify the subject's performance in various aspects. We structure our analysis pipeline into three layers, namely 1) pose estimation, 2) feature extraction, and 3) application layer, as illustrated in Fig. 4. The pose estimation layer provides frame-level high-quality 2D/3D joint locations (Section V-A). We pre-process the estimated pose to prepare it for feature computation. In the feature extraction layer, we calculate a set of features that describe subject's performance on various tests. We carefully design these features for each test separately to accurately reflect the subjects' performance and dedicated abnormalities. Lastly, the application layer contains several downstream tasks consuming the features, including abnormality detection and visualization for a qualitative comparison among recordings.

A. Pose Estimation
For upper body tests (FT, FTF, and FR), we use OpenPose (OP) to estimate the 2D hand [56] and body [17] pose. On the other hand, for SAW tests, we compute the 3D pose using the VideoPose3D (VP3D) package [18]. Given an RGB image, OP first detects all visible body parts and associates them to each individual by solving a graph matching problem. Meanwhile, VP3D adopts dilated temporal convolution to estimate 3D pose from sequence of 2D keypoints extracted from the video.
For upper body tests, if the subject and the moving limb is located parallel to the camera plane, then the motion is well approximated in a plane, i.e. in two dimensions. That is why 2D pose is chosen for upper body tests. However, this might not hold for the SAW test (especially depending on the camera view), hence urging us to use 3D pose for this analysis.

B. Pre-processing
We truncate a recording to only include the sequence of frames that are related to the subject performing the test. To account for variable distance of the subjects from the camera, we normalize the estimated pose by a reference length. For FT, FTF and FR tests, the reference is the length of the forearm. For SAW, the reference is the distance between the pelvis and neck joints. We compute the reference lengths as the median of the value across all the frames. In addition, as the estimated pose can be erroneous at some frames we use median and Savitzky-Golay filtering [58]. In our dataset, we have excluded 27 recordings due to unreliable and noisy estimated pose. Therefore, we only analyzed 348 videos in total.

C. Notations
Given the pose sequence estimated from the RGB video, we extract a set of quantified features. Below, we first express our notations and then introduce the features we defined for each test. Let υ = [υ 1 ,...,υ N ] denote the set of N frames ordered chronologically in video υ. There is a one-to-one correspondence between the time associated with each frame and the frame index, where t = [t 1 ,...,t N ] and t i = i/fps, fps denoting the frame per second rate of the video. Given υ and the pose estimation module (such as OP or VP3D), we extract the location of K keypoints in each frame. For convenience, we use the same indexing of the body joints for both 2D and 3D pose. However, to differentiate between the 2D and 3D pose, we denote each by B 2 and B 3 , respectively. Furthermore, we use H 2 to represent the 2D hand keypoints. An illustration of the hand and body skeleton trees alongside our indexing notations are provided in Fig. 5. Note that, for the sake of brevity, we have only indexed a subset of the keypoints that we are using in our analysis.
We reserve s k,* [i] for the location of the k-th keypoint at frame i, corresponding to skeleton tree * ∈ {H 2 , B 2 , B 3 }. For * ∈ {H 2 , B 2 }, s k* [i] ∈ ℝ 2 and for * = B 3 , s k,* [i] ∈ ℝ 3 . Furthermore, we add superscript r and l to point to right and left (R/L) body parts, respectively. For example, s 3, H 2 r i locates the tip of the right thumb at frame i.
To extract kinematic features that quantify the performance of a subject in a test, we track the location of various major keypoints and define a set of features accordingly. Major keypoints vary based on the test. For instance, the major keypoints in FT include the tip of the index and thumb fingers of two hands while in FR, we closely track the wrist joints.
In different tests, the subjects are asked to move certain limbs repeatedly. Thus, it is natural to compute features such as frequency, and amplitude for periodic pose patterns and report the mean and standard deviation (STD) across different cycles. In addition, for a test performed normally, the features corresponding to the R/L body parts should be close. Thus, to quantify the difference between the right f r and left f l features, we define an asymmetry metric as: Another useful metric in our analysis is Pearson correlation coefficient denoted by CC. For two 1D discrete time series x 1 and x 2 , we define CC as: where . and . T denote the mean and transpose operators. For highly correlated series, |CC| is close to one.

D. Feature Definition
We list the features defined for various tests in Table III and describe them in detail below.

Finger Tapping (FT):
For this test, the major keypoints are the tip of the R/L thumb and index fingers alongside R/L wrist and elbow joints. To extract properties of the periodic motion, we examine the distance between the tip of the index and thumb fingers across time defined as: Examples of d ft r and d ft l for normal and abnormal executions of the FT test are provided in Fig. 6. In our dataset, to simulate abnormality in FT the subjects are wearing a rubber band around index and thumb fingers of one hand. As also revealed in Fig. 6, this limits the tapping amplitude of the hand wearing the band and slows down the tapping rate. Given d ft * , we compute the period for the * hand, T ft * , as the time (in seconds) between two consecutive local minima (or maxima) of d ft * . Frequency F ft * is the reciprocal of T ft * . We also report the magnitude of finger-tapping A ft * as the difference in consecutive minima and maxima of d ft * .
We also report the asymmetry of the periods (Asym(T ft r , T ft l )), frequencies (Asym(F ft r , F ft l )) and amplitudes (Asym(A ft r , A ft l )) of R/L hands following (1).
Furthermore, we define the instant tapping speed and acceleration for R/L hands as the first and second order derivatives of d ft r and d ft l with respect to time. We adopt mean and maximum of instant speed and acceleration across tapping cycles as features. We also introduce average tapping rate as the average number of finger taps per second.
Finally, to evaluate the stability of the hands and arms during the FT recording, we examine the wrist and elbow joints. For this purpose, we introduce the relative height between (s 7 r , B 2 , where s 5, H 2 * † = s 5, H 2 * i † i = 1 N , † ∈ x, y and * ∈ {r, l}, is the x or y coordinates of the pose series. We also compute the period and average speed. We derive the average speed by dividing the traversed distance of R/L finger within half a cycle's period by half the cycle's period. Patients with neurological impairments tend to have tremors while moving their fingers during FTF test [59]. This leads to a deviation of the fingers' trajectory from a smooth curve. To characterize this deviation, we first fit a smooth curve to the fingers' trajectory, in the form of a second order polynomial in terms of the x and y coordinates. We observe that fitting a second order function to the trajectories, well matches the FTF trajectories of normal subjects. We consider the length of this smooth curve as a reference to compare against the length of the original fingers' trajectory. We then define the ratio of the length of the actual fingers' trajectory during each FTF cycle by the length of the fitted smooth curve as path smoothness metric (PS). We report PS for R/L hands. Examples of normal and abnormal finger trajectories alongside the smooth fitted curves are plotted in Fig. 7(b).
Another feature we found helpful in detecting abnormal function in FTF is instant velocity. We derive the instant velocity vector by the first derivative of the horizontal and vertical pose with respect to time. We then examine the angle between the vertical and horizontal components of this vector on the R/L hands. At time instant t, the velocity angle θ is: , * ∈ r, l . (8) Next, for each hand, we compare θ across different cycles using CC in (2). Given N C number of cycles, we have N C 2 CC values assessing the symmetry of the R/L velocity angles across different cycles, which we summarize by reporting the mean and STD. Examples of normal and abnormal aligned velocity angles across different cycles are provided in Fig. 7(c). Note that for abnormal FTF, large magnitude fluctuations, caused by tremors in moving the hands, visibly appear in θ.

Forearm Rolling (FR):
We include the wrist and elbow joints as the major keypoints for this test. We specifically attend to the vertical coordinate of the wrist joints to compute period T ft * and amplitudes A fr * for * ∈ {r, l}. Fig. 8 illustrates the vertical position of the R/L wrists for a normal and abnormal example. Note that, due to wearing the device in the abnormal recording, the period of the forearm roll cycles for both R/L hands are larger compared to its normal counterpart. In addition, similar to FT, we include the asymmetry of the aforementioned metrics in the FR features.
We also include the maximum instant speed and acceleration derived from vertical coordinates of the wrist joints. Similar to FT, we define rolling speed and rate. Rolling speed is computed as the difference between the minimum and maximum of y coordinate of the R/L hands divided by half the rolling period. Also, rolling rate is defined as the number of rolling cycles per second. Finally, we report the stability of the elbows C fr elbow and define it analogous to (5).

Stand-up and Walk (SAW):
We use the side-view SAW recordings in our analysis of SAW test. For SAW pose estimation, we use VP3D [18]. In VP3D, the joint locations are defined relative to the pelvis joint. As a result, estimated pose by VP3D misses the global position of subjects within a frame which is essential to detect different segments of the SAW test, i.e. stand-up (SU), walk (W), and turn (TU). This urged us to track the 2D position of the pelvis s 0,B2 extracted by OP as a notion of subject's global position in a video frame. Analyzing this position through time enables us to split a SAW recording into multiple non-overlapping SU, W, and TU segments. Supplementary Fig. S3 visualizes these segments.
For the SU segment of SAW, we focus on the time to stand [60], measured by the total time taken from the first SU effort to a full standing on feet state. We derive time to stand by thresholding the magnitude of the pelvis joint's velocity. Note that, since our subjects are asked to walk back and forth a designated room multiple times, at some points, they have to change direction and turn around. We report time to turn around as another indicative feature for SAW test.
The first set of features derived for the walking segment are obtained based on the distance between the two feet stated as: Note that, the periodic nature of a normal gait also reflects in d saw (see Fig. 9(a)). Given d saw , we highlight different W and TU segments in Fig. 9(a). For a gait pattern derived based on d saw , step time is the time to complete one step and computed as the time difference between two consecutive local maxima of d saw . Meanwhile, step length defined as linear distance between two successive placements of the same foot [61] manifests as the local maxima of d saw . The step width, on the other hand, is interpreted as the local minima of d saw . The calculations of these features in turning segments are excluded.
As two global features for gait, we report mean and STD of cadence and average speed across all W segments. We compute cadence as the number of steps divided by the duration of a walking segment. Average speed is determined by the total traveled distance of the pelvis joint divided by the duration of a walking segment.
To evaluate the symmetry of the R/L gait, we introduce the cross correlation between the knee angle series of R/L legs, denoted by S saw knee angle . We find this feature a good descriptive of gait abnormality, as in our recordings, gait abnormality is introduced through wearing a knee band which limits the knee motion (Fig. 3). For each frame, we define the knee angle as the angle between Examples of aligned normal and abnormal knee angles for R/L legs are shown in Fig. 9(b). For normal gait, the R/L knee angles are highly correlated after alignment ( Fig. 9(b) top row), while this does not hold for abnormal gait ( Fig. 9(b) bottom row).
In addition, we define step symmetry between the R/L feet movements by comparing the horizontal position of R/L feet at different gait cycles. We represent this metric by S saw feet-x . To compute S saw feet-x , similar to S saw knee angle , we first align the R/L horizontal positions within each gait stride and report the CC of the aligned series. We report mean and STD for both S saw feet-x and S saw knee angle across different cycles.

A. Subject-based Normal Vs. Abnormal Comparison
In this section, we compare the normal and simulated-impaired performances of the same subject and show that this analysis is insensitive to the choice of recording device and robust to the viewpoint or distance from the camera. Note that in our dataset, for each subject, we have four sets of recordings. Two of these recordings capture the normal performance of the test, while in the other two, the subject is asked to perform abnormally. In addition, two pairs of normal/abnormal recordings are captured by an iPhone (P) and an iPad (T). Let N P /N T and A P /A T denote the normal and abnormal recordings captured by iPhone/iPad.
For each feature and subject, we define A-A/N-N as the intra-class distance between the features derived from the abnormal/normal recordings of the subject captured on iPhone and iPad devices. In other words, A-A is the distance between features computed for A T and A P recordings, while N-N marks the difference between the features of N T and N P videos. For N-A, we consider the distance between A T -N P and N T -A P pairs and report the average. We normalize the A-A, N-N, and N-A distances by the maximum of N-A distances.  Table III and normalize them before passing to PCA. Fig. 11 showcases the results for different tests. It is observed that the normal and abnormal recordings are separated in dimension reduced feature space. This implies that our defined features are descriptive and well differentiate normal from abnormal. Fig. 12 we compare the distribution of normal versus abnormal features for FT, FTF, FR, and SAW tests. These plots clearly indicate the difference in distribution between two classes. Normal features are more concentrated in a specific range, however the abnormal features are often less regular and have a higher STD.

Abnormal Class Distribution: In
Abnormality Detection: We assess the normal and abnormal classification performance using our features. Therefore, we utilize several machine learning (ML) models that are grouped into: 1) tree-based methods such as Random Forest (RF), Gradient-Boosting Machine (GBM) [62], XGBoost [63] and 2) parametric models trained using gradientdescent updates, including Logistic Regression (LR), Support Vector Machine with radial basis function (RBF) kernel (RSVM) and Multi-layer Perceptron (MLP) with rectified linear unit (ReLU) activation.
We also benchmarked our ML classification performance against two deep learning (DL) baselines. Both DL models predict normal versus abnormal based on major keypoint pose sequence, unlike the ML based models which perform classification on the extracted spatiotemporal/kinematic features. In the first DL baseline, we adopt a long-short term memory (LSTM) [64]  We perform 5-fold cross validation and summarize the average classification performance of all ML and DL models in Table IV. While all models perform well for various tests, among ML models RSVM and GBM/XGBoost tend to perform better on most metrics. However, the gap between the performance of all ML models is not significant. This suggests that the extracted set of features well-distinguish normal from abnormal samples.
Furthermore, comparing ML and DL models, we notice that: 1) While DL models perform well on FT, FR and FTF tasks (especially for video-based splitting), they are lagging behind ML models for SAW. We attribute this to the fact that SAW involves more complex motion patterns. Therefore, DL models require larger datasets to be able to learn the classification task from the pose data. 2) DL features extracted from the pose data lack clinical interpretability. 3) For subject-based splitting, ML models operating on the spatio-temporal/kinematic features outperform DL models on most metrics. This indicates better generalization capability of our features on unseen subjects compared to DL models operating on pose data.

C. Feature Importance Analysis
One benefit of tree-based models is in the tractable decision-making process. Therefore, we investigate the importance of each feature, contributing to the decision process by analyzing our RF models. This analysis gives us the weight of all features, sorted in descending order in Supplementary Fig. S4.
We notice that symmetry between specific R/L features for FT, FTF, and SAW tests is considered the most important, i.e., with the largest weight. For the SAW test, the most important feature is the similarity between the knee angle time series across different cycles (S saw knee angle ) while for FT ( Supplementary Fig. S4(a)) and FTF ( Supplementary Fig. S4(c)), the features with the largest weights are frequency asymmetry and horizontal (S ftf finger-x ) symmetry, respectively. Although this can be attributed to the nature of the simulated impairments in our dataset, it is consistent with the clinical practice, where the left and right asymmetry is a common biomarker [66]- [68] of different neurological disorders.
Furthermore, temporal and spatial features that characterize the periodic behavior of the movement are important metrics that the decision tree classification models rely on. Examples of these features are amplitude and period for FT, FTF, and FR tests, step length, width, and step time for SAW. We also notice that for a subset of features, having large variations (i.e. STD) across different cycles is another indicator of abnormal performance in our dataset. This is captured in the large weight associated with STD values of some features for various tests. This result also affirms our observations in Fig. 12.

VII. DISCUSSION & CHALLENGES
In this section, we discuss various aspects of DNE including feature design, robustness, clinical relevance and application as well as the current challenges and our proposed solutions.

A. Discussion
Feature Design: The main goal of our DNE system is to provide an objective tool for quantifying and documentation of recordings of neurological tests. Thus, it is critical to design a set of clinical interpretable features that explain the performance of a subject on various motor tasks. In addition, having powerful digital biomarkers reduces the workload of normal versus abnormal classification models and improves their generalization, especially when large training datasets are not available. Furthermore, unlike black-box DL models, the explainability of our diverse set of features allows clinicians to better understand and track patients' status over time.
Robustness: DNE is resilient to changes in slight deviations from the camera view, distance to the camera, subject clothing, and mild pixel intensity changes due to intermediate data standardization and robust pose estimation steps (Section V-A and V-B). This is experimentally shown by the low intra-class feature distances in Fig. 10. Data normalization and filtering in the pre-processing step also helps in eliminating noise and propagated errors from the pose estimation module.
In FT, FR and SAW tests, the abnormality in the motion is imposed by wearing equipment which are visible in the recordings. The pose estimation models we have used (OP and VP3D) are robust to the appearance of the equipment and can accurately predict the joint locations regardless of the presence of the equipment. The features incorporated in the classification tasks are derived from the pose data. Therefore, the quantified features and the classification performance is not affected by the visual cues from the equipment.
Clinical Relevance: In our dataset, the abnormalities in the movements of the subjects were simulated. The simulated impairment in the FR test is the closest to what is observed in clinics for patients with neurological disorders. In the simulated impairment for FR, the arm with no moulage satellites around the weighted wrist, causing a decrease in the orbit frequency (Fig.8). This is coherent with the clinical observations of patients with neurological impairments.
In the FTF test, the simulated abnormality would be more realistic, if the tremor or inaccuracy of movement increased as the finger got closer to its target (i.e. when the two fingers approach). In our current dataset, the subjects often simulated the tremor throughout their movements which is only seen in severe cases. In addition, for the FT test, often the abnormality is a combination of decreased amplitude and rate (Fig. 6) and in Parkinson's decrements of both. In our DNE dataset, some subjects simulated more of one or the other.
In SAW, the abnormality in real patients appear as a combination of slow time-to-stand, decreased step length, increased step-time, and asymmetry of gait features. In our dataset, the abnormality was imposed by wearing a knee brace. Alongside asymmetry between the R/L knee angles, we observed decreased step-length for the subjects wearing the knee brace ( Fig. 12-SAW). These are in-line with clinical observations from real patients.
Overall, features that clinicians observe were disrupted from normal findings to various degrees, although the pattern of disruption of features may have not been exact for a specific condition. We showed that DNE was able to define clinically interpretable features and detect differences between normal and simulated impaired recordings. As future work, to expand its clinical impact, we will focus our analysis on real patients with various neurological impairment severity levels, and with other neurological tests, such as eye movement [69], facial activation [70], [71], or phonation [72].

Clinical Application:
The initial clinical application of DNE is measuring and documenting features of various neurological exams. This would allow for improved communication of objective exam quantification and the ability to assess for changes over time. As future work, with clinicians' supervision, we will examine and report the performance of DNE on real patients. A longer term goal is to assist clinicians with classification of recordings and provide a platform for longitudinal monitoring of patients.

B. Challenges
Depth Ambiguity: Analyzing human motion from 2D RGB data requires dealing with uncertainties associated with lacking depth information. Furthermore, depth ambiguity becomes a more prominent challenge for the SAW test with frontal view recordings rather than sagittal view. It also avoids defining the spatial features in their absolute units. Currently, to mitigate the issues corresponding to these depth uncertainties, for upper limb tests, the subjects are asked to perform the tests while facing the camera and (roughly) in parallel to its image plane. In our processing steps, we also perform pose normalizations to compensate for scale variations due to variable distance from the camera. To further address this issue, we believe incorporating LiDAR depth maps captured by recent iOS devices, in the pose estimation step can prove helpful.
Self-baselining: Natural motion properties differ across various subjects. For example, one subject can be inherently slower or have less strength in performing some tests. In our dataset, we witnessed while some subjects had a slower inherent speed in their normal performance, they were mistakenly classified as abnormal. This highlights the importance of taking into account the history of a subject and self-baselining. In our experiments, we show cased an example of self-comparisons of normal and abnormal performance of the same subject (Fig. 10). The purpose of this study was to show the ability of our designed features to discriminate between the varying status of the subject at different test times. This result validates the potential of our DNE pipeline as a personalized medical assessment system.
Real-time DNE: Our current DNE system and the extracted kinematic/spatio temporal features rely on tracking the human pose from the video recordings in an offline step using off-the-shelf pose estimation modules. Currently, the pose estimation step is the most computationally expensive step, hindering real time processing and feature extraction. To address this challenge, on-device lighter pose estimation models (with small sacrifice on the accuracy), that focus on extracting major keypoints rather than the whole body pose are necessary.

VIII. CONCLUSION
In this paper, we proposed a comprehensive vision-based digital biomarker exam solution named Digitized Neurological Examination (DNE). Using DNE software, users video record their performance on various motor tasks, including finger tapping, finger to finger, forearm roll, and stand-up and walk. We introduced the DNE dataset, a total of 375 videos consisting of normal and impaired functions of 21 subjects, performing different tests. For each recording, 2D/3D pose is estimated and used to quantify kinematic and spatiotemporal features. These features form a set of digital biomarkers that can be 1) accurately obtained from common RGB videos with minimal calibration, 2) used to track the clinical changes across recordings at different time points. On our DNE dataset, we analyzed the effectiveness of the defined features in differentiating normal versus impaired simulated videos per and across subjects. Our results demonstrate high classification accuracy and F1 scores using a variety of machine learning models. Future work will extend the setting of this study to a larger set of subjects with a diverse range of abnormalities. Illustration of our digitized neurological exam system.    Examples of DNE dataset recordings. Impairments are induced by wearing a wrist brace for FR, a rubber band for FT and a knee brace for SAW tests.  Examples of SAW features.  The inter-class and intra-class distances between some features of normal (N) and abnormal (A) FTF recordings. ♦ denotes the mean value. PCA analysis of FT, FTF, FR, and SAW tests. Green crosses and red circles stand for normal and abnormal recordings. All subplots share the same axis. Distribution of normal/abnormal features for FT, FTF, FR, SAW tests plotted in first, second, third and last rows. We used kernel density estimation to fit distributions to the data.    The CC of the aligned R/L knee angle series within a walking segment (a full pass of the room length) Step symmetry Mean, STD, Median The cycle-wise CC of the aligned spatial trajectory of the R/L foot in the horizontal axis Step length Mean, STD, Median The furthest distance between two feet within each step Step width Mean, STD, Median The shortest distance between two feet within each step Step