Finger Joint Angle Estimation With Visual Attention for Rehabilitation Support: A Case Study of the Chopsticks Manipulation Test

Most East Asian rehabilitation centers offer chopsticks manipulation tests (CMT). In addition to impaired hand function, approximately two-thirds of stroke survivors have visual impairment related to eye movement. This article investigates the significance of combining finger joint angle estimation and a visual attention measurement in CMT. We present a multiscopic framework that consists of microscopic, mesoscopic, and macroscopic levels. We develop a feature extraction technique to build the finger kinematic model at the microscopic level. At the mesoscopic level, we propose an active perception ability to detect the position and geometry of the finger on the chopsticks. The proposed framework estimates the proximal interphalangeal (PIP) joint angle on the index finger during CMT using fully connected cascade neural networks (FCC-NN). At the macroscopic level, we implement a cognitive ability by measuring visual attention during CMT. We further evaluate the proposed framework with a conventional test that counts the number of peanuts (NP) which are moved from one bowl to another using chopsticks within a particular time frame. We introduce three evaluation indices, namely joint angle estimation movement (JAEM), chopstick attention movement (CAM), and chopstick tip movement (CTM), by detecting the local minima and maxima of the time series data. According to the experiment results, the velocity of these three evaluation indices could indicate improvement in hand and eye function during CMT. We expect this study to benefit therapists and researchers by providing valuable information that is not accessible in the clinic. Code and datasets are available online at https://github.com/anom-tmu/cmt-attention.

(CMT) to assess the skill components of the therapeutic chopsticks task [4] and their relationship with the hand function tests. The CMT has been widely used to evaluate hand dexterity when conducting tabletop activities.
Adults potentially face a decline in hand function after 65 years of age. This decline is primarily due to age-related degenerative changes in the musculoskeletal, vascular, and nervous systems [5]. This issue is often followed by several pathological conditions such as stroke [6] and other neurological diseases. Researchers have examined whether chopsticks manipulation among stroke patients can increase levels of eating independence [7]. On the other hand, about two-thirds of stroke survivors have a visual impairment [8] related to visual field loss, double vision, and perceptual problems. This problem impacts the difficulty some people experience in gripping chopsticks in everyday life [9]. As a result, CMT measurements become less valid when visual attention is not considered [10]. The combination of hand function and eye-tracking measurements is necessary to identify. The upper limbs and eye movement coordination cannot be separated because they support each other in the study of hand-eye coordination [11]. Recent studies show that gaze features [12] can be helpful in the identification of hand-eye coordination issues. It is necessary to define human visual attention and hand movement measurement during CMT, especially for a stroke patient with hand-eye movement problems. Table 1 shows the research on hand measurement systems using chopsticks.
Research on egocentric videos [13] has the potential to help this study by incorporating the hand image data of the first-person vision. Egocentric, first-person vision records the user's point of view [14] from a wearable camera. An egocentric video provides detailed information on hand posture, movements, and the objects or environment in which the hand interacts. The advantage of egocentric vision is reduced user privacy concerns [15], the ability to monitor people when they are mobile [16], and finding attention during activities [17]. Currently, egocentric vision research has begun to be developed for hand rehabilitation [6]. Unlike egocentric vision, visual attention refers to maintaining the awareness of visual scene's specific locations, objects, or attributes. Visual attention is usually obtained from sensors that detect the VOLUME 10, 2022 direction of the eyes and map a point on an image plane. Egocentric vision and visual attention shape human perception when performing activities requiring hand-eye coordination. However, its application has not been found in the case of CMT, which considers hand-eye coordination.
Therefore, our research aims to integrate finger measurement and visual attention for CMT assessment using a multiscopic framework. The main contributions of this paper are as follows. First, we apply a CMT measurement using egocentric vision-based feature extraction with marker detection. This framework is adopted to find the model parameters that efficiently describe a physical embodiment of CMT at the microscopic level. Second, we develop the active perception ability of hand-object interaction based on a taskdependent rule set. We use fully connected cascade neural networks (FCC-NN) to estimate the angle features for the proximal interphalangeal (PIP) joint angle on the index finger based on the chopsticks' positions. Third, we propose a system with a cognitive ability at a macroscopic level that can measure visual attention in relation to the chopstick tip. We use eye tracker glasses to detect where the eye is looking, then compare it with chopstick tip movements. We evaluate hand-eye coordination by symbolics hand-object interaction based on task domain knowledge at the macroscopic level. Figure 1 shows the design of a multiscopic framework for the CMT with visual attention.
This paper is structured as follows. Section 2 reviews the related works on the use of technology for CMT. Section 3 proposes our method for developing the multiscopic framework. Section 4 examines the results and discusses the effectiveness of the proposed method. Finally, section 5 presents the conclusions and suggestions for future works in relation to this research.

II. RELATED WORKS
A few studies related to CMT measurement have been conducted to examine the complexity of finger measurement analysis. Sensors as contact methods have been developed to support CMT measurement. Chia and Saakes [18] installed movement sensors on chopsticks to interact with children with a tablet. Nakamura et al. used an inertia measurement unit to obtain the chopsticks' state on their Senstick [19]. For specific CMT research, Sawamura et al. [20] utilized near-infrared spectroscopy (NIRS) to obtain measurements of brain activity in chopsticks-operation skills with the nondominant hand. Yokubo et al. [21] employed force sensors to determine the relationship between the chopstick tips and cross-sectional shape in developmental stages from infancy to early school years. Shimomura et al. [22] applied electromyography to measure how the motor controls the hand muscles, clarifying the effect of chopstick manipulation.
The development of cameras as non-contact methods has begun to be used to detect the chopsticks' movement. Okamoto and Yanai [23] use a smartphone camera to monitor chopstick actions during mealtime. Nakaoka et al. [24] used a camera attached to chopsticks for coloring paintings to encourage healthier eating habits. Namiki and Makino [25] implemented a camera and a contactless motion sensor with augmented reality to help beginners learn how to hold chopsticks. Wang and Zhang [26] designed smart chopsticks based on HSI color space recognition for dining reminders.
In other fields, psychology researchers believe that good eye movement also supports human hand skills [27]. Inspired by the nature of human vision, they found that humans learn from seeing things, and hand skills are influenced by their perception ability [28]. In other words, human dexterity in using tools [29] depends on hand-eye coordination [30], [31]. So, the measurement method is more effective when considering visual attention. These two capabilities must mutually support each other to achieve the goal-the eye functions as a sensor coordinate with the hand functions as an actuator. Thus, a suitable measurement method for detecting a hand object manipulation function must consider eye performance. One measurable function is tracking eye movements [32], which implies visual attention.
Currently, it is challenging for therapists to practice CMT following the COVID-19 pandemic. Tele-rehabilitation using CMT is the right choice for monitoring hand function decline and measuring outcomes in patients independently. Several studies have applied cognitive-based measurement systems [33] to monitor daily activity, employing contact methods such as sensors on chopsticks [34]. However, the application of tele-rehabilitation is limited due to privacy issues at the social level, and the complexity of hand measurement analysis causes lower accuracy than the contact method [35] at the physical level.
Many activities involving hand-eye coordination still need to be investigated. Researchers have discussed the relationship between eye and hand movements. Sports science research [36] found that visual attention can indicate the performance of hands in basketball shooting. Another study compares eye movements with finger and hand movements in a mouse-click activity [37]. This study was in line with research on human vision, which shows that eye movement always precedes finger movement, especially the index finger [38]. Changes in eye movement were represented by actions of the point of visual attention, and hand movements were defined by changes in the PIP joint angle of the index finger.
Bosch et al. [39] showed that gaze features could be helpful in the identification of behavioral performance and visual strategies during chopstick skill acquisition. However, this study does not estimate the effect of visual attention on finger movement during chopstick manipulation. Thus, it is necessary to define human visual attention and hand movement during CMT, especially for a person with hand-eye movement problems [11]. CMT with visual attention is a new measurement framework that considers eye and finger movement when using chopsticks. Combining finger joint angle measurements and eye movements determines which factors are more dominant in influencing CMT results [38].
Several examples show that the current CMT uses several elements of technology, especially movement sensors, body sensors, and cameras. However, researchers rarely discuss the application of CMT that considers human visual attention as an additional parameter. Designing the CMT framework with visual attention is necessary to support hand function performance. The technology to combine hand measurements and visual attention can be easily captured on smart glasses with eye-tracking features [40]. A physiology study showed that task-oriented tabletop activity training effectively restored upper limb function and visual perception in acute stroke patients [41]. However, machine learning algorithms to process multivariate data incur a computational cost in cyberspace. Combining different methods to deal with the above problems becomes a significant challenge. One appropriate way to address this issue is multi-scope processing [42], which is performed in stages.
Our previous work used a multiscopic approach to address dynamic locomotion in a legged robot [43]. Based on this experience, we propose a multiscopic framework for developing a CMT finger measurement system with visual attention. Figure 2 shows the design of the chopsticks and the visual attention measurement system. This proposed model integrates intelligent functions from a multiscopic point of view. This approach includes a) modeling finger kinematics and the chopsticks' state for feature extraction at the microscopic level, b) estimating the PIP finger feature [44] using an active perception ability at the mesoscopic level, and c) discovering the human visual attention using cognition ability at the macroscopic level.

III. PROPOSED FRAMEWORK
These sections discuss the multiscopic approach that is used in CMT with attention. We explain the microscopic, mesoscopic, and macroscopic levels in three sub-sections. This section describes our method: setting up the experiments, extracting finger features, detecting the chopsticks' state, estimating the joint angle, and measuring visual attention during the CMT.

A. MICROSCOPIC LEVEL: FEATURE EXTRACTION
Feature extraction is closely related to the data acquisition process and retrieves the attribute. Figure 3 shows the microscopic level: the feature extraction process in CMT, including finger kinematic and the chopsticks' state models. We propose a finger kinematic model and chopsticks state model to represent finger action in CMT to obtain the physical embodiment. In this section, we describe both models in detail.

1) FINGER KINEMATIC MODEL
Finger kinematic models [45] have been widely studied in various research, especially in the industrial sector, such as hand modeling and simulation, hand motion coordination, articulated human hands, prosthetic hands, and robotics. Our previous work [46] used a stereo infrared (SIR) camera from Leap Motion to capture the position of the hand and fingers. This sensor tracks fingers in a 3D zone extending from 10 cm to 75 cm and from the device in a 170 • × 170 • ordinary field of view. The infrared stereo vision sensor takes every joint position and transforms it into a skeleton at 100 fps. This sensor is connected to a single application for hand visualization in the Cartesian space. After setting up the desktop view, the images are processed to remove noise and produce finger models [47]. This sensor is able to reduce occlusion in the finger position with reasonable accuracy. We describe these stages to classify grasp poses taxonomy from an egocentric view.
A finger has three joints, the first joint A connecting the metacarpal and proximal, the second joint B connecting the proximal and intermediate phalanges, and the third joint C connecting the intermediate phalanges and distal phalanges. The thumb does not have metacarpals. We transform the data into a straightforward representation using finger kinematic models. Figure 3 (a) shows the finger kinematic model. This model is used as a joint rotation configuration [48], which is included in this framework to validate individual-specific finger motion precisely.
We derive a formula to calculate the joint angle of the finger based on the physical relationship of human body. To simplify the data, we acquire a feature to replace the joint's position by obtaining the angle formed by the joints between two bones. We perform three points in the vector-to-angle transformation to obtain these features. If we have Cartesian coordinates for three points, A, B, and C, then we can calculate − → AB and − → BC. The following are the equations to obtain two vectors from three points in the 3D coordinates: Next, we calculate the angle formed by A → B → C using the right-hand rule from B. The following is an equation for the dot product: BC is the length of − → BC, and θ PIP is the angle between these two vectors. In this way, we can calculate θ PIP by substituting the equation. The following is an equation to find θ PIP : We use the standard hand pose in this CMT framework while holding chopsticks. Only the index and middle fingers are used to move the upper chopsticks, and the thumb, ring finger, and little finger are used to hold the lower chopsticks tightly.
The position of all fingers in a Cartesian coordinate would lead to many possible joint positions. We record the index finger's trajectory and calculate the PIP joint angle in the chopsticks' activity. The movement of joint positions shows PIP's range of motion (ROM) during CMT. After obtaining the finger position in the Cartesian coordinates, we record each joint's minimum and maximum positions. With this geometric representation of the finger, we are able to represent the kinematics and its motions. We use two red markers on a single chopstick. The first marker is attached to the bottom handheld chopstick (x bot , y bot ), and the second is to the upper (x up , y up ). The distance between these two markers is 100 mm. These markers assist in obtaining the angle and length between two upper and two bottom features [49]. These features are essential to obtain the characteristics of the chopsticks' activity. Figure 3 (b) shows the angle and distance between the chopsticks' upper and bottom features. We calculate the estimated upper angle θ up and bottom angle θ bot feature by: where x bot1 , y bot1 and x bot2 , y bot2 are the respective coordinates of the bottom chopsticks features position then x up1 , y up1 and x up2 , y up2 are the respective coordinates of the upper chopsticks features position. This approach is beneficial in reducing the computation in hand recognition processing [50], for example, detecting poses, actions, or gestures.

2) CHOPSTICKS STATE MODEL
In the CMT framework, we use red chopsticks markers and blue finger markers. A red marker is attached to the chopsticks for finger joint angle estimation by obtaining ground truth data to minimize errors. We perform marker detection on the index finger because it is visible in egocentric vision [6], while the middle finger is closed. During the testing stage, the blue marker is removed so that participants feel comfortable when collecting data. In the mesoscopic setting, this blue marker is used as a marker for the tips of the chopsticks. We obtain the finger features by connecting the center of the blue marker with the center of the nearest blue marker to form a skeletal finger. We use smart glasses as non-contact wearable sensors on an egocentric view. [13] and set them up from the participant's perspective. We install smart glasses with an average of 75 • facing down to obtain the best results. After setting up the egocentric view, we prepare the algorithm to detect chopsticks and finger markers. We use HSV color extraction and median-dilate filtering to remove the noise, strengthen the features, and detect the contours. There are two things we can obtain from these feature extraction processes. First, we can identify the feature in a specific image, and second, we can determine the exact location of the feature within the image. We measure the length between these two upper markers and two bottom markers with a pair of chopsticks. We are able to distinguish between the pinch and move mode by measuring two parts of the length of the chopsticks. These features are essential to obtain the characteristics of the chopsticks' activity. We calculate the estimated upper length L up and bottom length L bot feature as follows: .100 mm (7) .100 mm (8) The above equation shows how we calculate the length using the Pythagorean formula. In simple terms, the above calculation compares the distance between the center of the marker on one stick of the chopsticks in pixels with the actual marker length. We can take this result of the microscopic level process to the next level, as discussed in the next section.

B. MESOSCOPIC LEVEL: ACTIVE PERCEPTION
Active perception is the ability to capture, process, and actively make sense of the information at the mesoscopic level. To determine an active perception ability, we design a task-dependent rule set using the finger state from feature extraction at the microscopic level. Task dependency is a relationship in which tasks depend on regulations to be performed in a specific order developed in our previous study [51]. This task relates to CMT activities where the fingers interact with the chopsticks. To ensure that the index finger is measured with precision, we attach markers to each joint of the index finger. These markers are connected one by one to form the skeletal finger. However, the marker affixed to the finger is uncomfortable for CMT participants. Therefore, we estimate the PIP joint angle with supervised learning to remove the marker on the finger. This idea is inspired by the problems associated with therapy assessment, which sometimes makes the patients feel uncomfortable when sensors or markers are attached to their hands. We chose seamless monitoring in the CMT process by estimating the PIP joint angle with supervised learning using the chopsticks feature to remove the marker on the finger. Our framework prioritizes two main sub-processes: pinching mode (including grip) and moving mode (including hold and release). To separate the two sub-processes, we use the angle inclination parameters on the upper and lower chopsticks (θ up , θ bot ) together with the upper and bottom length features (L up , L bot ). Figure 4 shows the mesoscopic level: feature extraction to active perception and symbolic cognition.
We use Cheng-type FCC-NN [52] and compare it with conventional architecture such as multi-layer perceptron (MLP). We apply the upper length (L up ) and bottom length (L bot ) as the input of neural networks to obtain the PIP joint angle (θ PIP ) as output. We collect data for 1 minute for all chopstick manipulation poses in different directions. In FCC-NN, a hidden layer is limited to only one node. If we add a new node, it will increase the number of hidden layers. So, the number of hidden layers is always the same as the number of nodes. We employ FCC-NN consisting of one hidden layer as the learning architecture. It has a fully connected layer characteristic, where each neuron is connected to the others. It is different from MLP, in which nodes in the N layer are given input only from the output nodes in the N-1 layer. In FCC-NN, nodes in each layer obtain the input from all output nodes in the previous layer (N-1, N-2, . . . ). The output of the k-th neuron in FCC-NN is caculated by: From the above equation, the input z k is processed by an activation function f . The output is x k , where k is the number of nodes n. We have two layers in the FCC-NN design, so the first layer (layer-1) is the hidden layer that consists of 1 hidden node x 3 . The output of this hidden node (x 3 ) is obtained from x 1 and x 2 . Each of the variables is multiplied by w 1,3 and w 2,3 with the following equation: The second layer (layer-2) consists of three input nodes (x 1 , x 2 , x 3 ) with the following equation: The output node (y) in the output layer is obtained from x 1 , x 2 , x 3 each of which are multiplied by w 14 , w 2,4 and w 3,4 . We determine the backpropagation with a backward process on Cheng-type FCC-NN.
We use the Savitzky-Golay filter, which was successfully implemented on EEG [53], to refine the PIP joint angle data from the error spike. This filter is widely used to increase the time series data's precision without distorting the data's trend. This process is essential to detect the slightest signal tendency during the chopsticks' movement. The following is the equation for the Savitzky-Golay filter: The data consists of an array set representing points {x j , y j , where j = 1, . . . , n. The symbol x j is an independent variable (time) and y j is an observed value (joint angle). They are treated with a set of m convolution coefficients C i , according to the expression. We choose the number of coefficients and the polynomial order to fit the samples. These hyperparameters are selected because they are the most suitable base on several experiment results.
In CMT, there are two finger movement postures: pinch and move mode [54]. We found local minima and maxima to detect these modes by comparing neighboring values. We use two parameters in this function, namely height, and distance. The height parameter is used to determine the minimum peak height. We choose a height parameter of Root Mean Square (RMS) as the pinching and moving mode threshold to separate the peaks and the valleys. The minimum horizontal distance required in the sample between adjacent peaks in the distance parameter. This parameter eliminates the smaller peaks first until the condition is met for all remaining peaks.
After distinguishing the pinching and moving peaks, we propose joint finger measurement values in CMT [20], namely Joint Angle Estimation Movement (JAEM). This movement parameter includes a variable, namely the T JAEM which is the average time required for chopsticks to open and close in milliseconds. We can obtain the T JAEM by substracting the nearest peaks and valleys time peaks T V.JAEM with the time peaks T P.JAEM , base on the following equation: Then, the ROM JAEM PIP joint angle is estimated for chopsticks when open and close in degrees ( • ). This second variable is obtained by finding each signal's average distance from peak to valley. Finally, we determine ω JAEM , the angular velocity of JAEM, by dividing the change of its Range of Motion (ROM JAEM ) by the change of T JAEM . The following is the equation for finding ω JAEM : The ω JAEM show how fast the PIP joint angle changes during CMT in angle degree per millisecond ( • /ms). This variable measures the development of a person's finger motor function during rehabilitation with CMT. We design symbolic cognition from the task domain knowledge at this level to support our previous study [55]. This task is related to CMT activities where the fingers interact with chopsticks. After obtaining the angle value in the upper and bottom of the chopsticks, we detect the two chopsticks' modes with supervised learning. We use the MLP module as the learning architecture. The input layer contains two nodes (θ up , θ bot ), while the output layer has two nodes (y 0 , y 1 ), a class of chopsticks grip poses (pinching and moving mode). We use fully connected networks, so every unit receives a connection from all the units in the previous layer. Our model uses one hidden layer consisting of three hidden nodes.
We employ a training loop similar to previous models: we establish an optimizer, feed the inputs to the model, calculate the loss, and then optimize it using the autograd function. The following are the training data steps: (a) standardize the data structure so that it is accepted as an MLP input with node and weight properties, (b) perform a linear transformation on the incoming data, and (c) employ the sigmoid function element-wire during the learning process. (d) The SoftMax layer generates the final classification and decides. The aim is to determine which input data class belongs to the two CMT pinch or move mode categories. Each node in the output layer receives output from the model. We obtain information via the SoftMax function.

C. MACROSCOPIC LEVEL: COGNITIVE ABILITY
Cognitive ability involves human attention to a particular object and expanding on it over an extended time. We design this ability from the task domain knowledge at the macroscopic level. This task is related to CMT activities where the chopsticks' movement is based on visual attention. Figure 5 shows the macroscopic level: cognitive ability design using eye movement and chopstick tip tracking. The direction of attention is usually focused on the object and the tip of the chopsticks. The movement of the chopstick tip also follows the person's attention. We utilized Tobii Pro Glasses 3 to get attention on the image plane [56]. This intelligent eyeglass gives us information about the attention location represented by a red circle in the image area. The location of the center of attention can rapidly change according to the movement of the eyes. We also put blue markers on both tips of the chopsticks to determine the location of the chopstick tips in the image coordinate.
We detect the red circles and the blue marker using preprocessing techniques as the feature extraction at the microscopic level. We apply HSV color extraction, median-dilate filtering, and contour detection [49]. Then, we obtain the centroid of the attention in the x and y coordinates of the image. The placement of the peanut bowl is made vertically. The source bowl is located at the bottom, and the target bowl is above the image plane. We simplify the movement tracking on the y coordinate.
This system gives two output signals. First, the red signal indicates the Chopstick's Attention Movement (CAM), while the blue signal specifies the Chopstick's Tip Movement (CTM). At first glance, these two signals look the same because the attention movement moves almost simultaneously with the tip of the chopsticks. We measure the time required to perform CAM and CTM and compare these data. These two evaluation indices simultaneously determine the speed of the hand and eye movements. Figure 5 shows the red and blue signals with several peaks due to eye and chopstick tip tracking. We measure the time from valley to signal peak and return to the second valley indicated by the arrow symbol. The time series data represents the activity that succeeded in picking up peanuts and putting them down until starting to pick them up again.
We measure time by finding the peaks and valleys for each CAM and CTM signal using the following discriminant equation: The above discriminant equation is obtained by comparing the data at t and t + 1. After we obtain the discriminant results CAM from the CAM and CTM from the CTM, we determine the peaks and valleys in the new time series data with several parameters such as distance and prominence. We obtain the best results at a distance value and prominence by conducting several experiments. After obtaining the peaks and valleys of CAM, we look for the nearest peaks and valleys by subtracting each valley time T V.CAM from the peak time T P.CAM . For the CTM, we obtain the nearest peaks and valleys by subtracting each valley time T V.CTM from the peak time T P.CTM , based on the following equation: After obtaining the distance and time parameters, we then obtain the linear velocity v CAM and v CTM using the following The T CAM and D CAM are the average time and distance achieved by CAM from one bowl to another and the T CTM and D CTM is the average time and distance CTM achieves from one bowl to another. The v CAM and v CTM both have units of pixels/ms. These two variables are evaluated in the CMT experiment with scenarios using regular and trainer chopsticks. In the following sections, we discuss the results of our proposed method.

IV. RESULT AND DISCUSSION
First, we discuss the feature extraction ability at the microscopic level. The finger model kinematic is used to estimate the finger's features [59]. However, not all feature extractions have precise results because the position of the fingers is not always perpendicular to the camera. This issue was addressed by comparing the length of the joint position with the length of the chopsticks. Therefore, we can only use RGB cameras rather than depth cameras to obtain acceptable results. The marker detection based on contour detection was successfully applied to obtain the position of the chopsticks and index fingers compared to surface EMG [60]. However, marker detection did not always provide a high level of accuracy due to various factors such as occlusion and lighting. Minor errors occur when the hand moves quickly and produces a blurry image. The background color of the environment was limited, and the area had uniform lighting to obtain the best results. The CMT video result is available at our github. Second, we examine the participants' active perception ability at the mesoscopic level. We collected data from the One-Minute Chopsticks Challenge for ten participants who are not habitual chopstick users. Using chopsticks as a novel tool, we are inspired to test the behavioral performance improvement and the visual strategies used during skill acquisition [39]. Before trying it on rehabilitation patients, we obtained permission to test CMT on international students. We collected data from ten users as participants. All the participants gave their informed consent before this study. The following is the Standard Operating Procedure for conducting experiments: a) Experiment ethics: The researcher informs the participants that this study was approved by the Ethical, Legal, and Social Issues (ELSI) of Tokyo Metropolitan University, Japan. The eye-tracking data collected in the experiment will be stored and used in accordance with the eye-tracking data transparency policy [57], which relates to: (i) what data to collect, (ii) how the data will be used, (iii) what the eye-tracking data will not be used for, and (iv) how the data will be helpful to users. b) Chopsticks trial: After the participants have agreed to participate, the researcher gives them one minute to read the guidelines about using regular and adult-sized Edison training chopstick, which is commonly used to help people who have difficulty gripping in tabletop activities [40]. The training chopsticks have two holes on the upper chopstick into which the participants place their index and middle fingers so their fingers can grip correctly. The chopstick trainer is selected so that the participants, who are not habitual chopsticks users, could engage in error-less learning. We can assume linearity in the time-related variables by ensuring that the participants practice using chopsticks under the paradigm of error-less learning. It will ensure that all participants succeed in their initial task because many failures will cause participants to give up. c) Preparation and setup: The first step in the preparation and setup is to condition the participants to be ready for the experiment [58]. The preparation and setup that was undertaken are as follows: (i) peanuts are placed in the bowl closest to the participant, then the empty bowl is placed on the far side; (ii) participants wear eye trackers, and they can request additional lenses if needed; (iii) the dominant hand holds the standard chopsticks, then places them on the square pad; (iv) the non-dominant hand is folded inward close to the body (or it may be placed on the table); (v) the computer system is connected to the eye-tracker wirelessly; (vi) the trainer chopsticks are prepared; (vii) participants calibrate the eye-tracker with a calibration card; (viii) the battery is checked every time the experiment is finished, and the battery is replaced if the capacity is <30%; (ix) pictures are taken from behind the participants. d) Experiment: We ask participants to take as many peanuts as possible in one bowl and then transfer them to another. Each participant took part in two experiments: (i) Scenario A using regular chopsticks (1 minute); (ii) Scenario B using trainer chopsticks (1 minute). In scenario A, the participants are required to use chopsticks. In scenario B, they are required to use training chopstick. Both scenarios are used to test the system at the mesoscopic and macroscopic levels. e) Closing: Participants remove the eye-tracker and lens and return these to their original place; researchers return the nuts to their original place; all equipment is ready for the next new participant. We successfully implemented FCC-NN as a learning algorithm to estimate the PIP joint angle in the preliminary test. The upper length (L up ) and bottom length (L bot ) were the inputs for the networks, and the PIP joint angle (θ PIP ) was the output. We compare this result with MLP-NN using several hidden nodes and evaluate the performance. Figure 6 shows the PIP joint angle estimation training using MLP and FCC-NN testing results. FCC-NN training loss decreases more rapidly than the other architecture. We found that learning using FCC-NN has slightly better accuracy with less learning time than MLP with three hidden nodes. The accuracy obtained is 98.9 %, with an average error of 0.0339 radians (1.947 • ). Therefore, FCC-NN provides slightly better results and speeds up the process [52]. However, there is no guarantee that the accuracy and learning time of the above neural networks will be the same on different data sets and under different conditions. This can be due to differences in hand size, a non-standard way of holding chopsticks, sensor readings, or any environmental changes. Table 2 compares FCC-NN with MLP in the 10.000 th epoch.
At the mesoscopic level, we also propose a Joint Angle Estimation Movement (JAEM) evaluation indices to predict the movement of the PIP joint angle when using chopsticks. We measure the differences in a pinch and move from these two scenarios. The experiment results indicate that all participants achieve better JAEM results in less operating time in scenario B than in scenario A. We found the angular velocity of JAEM (ω JAEM ) can be used for further finger analysis [61]. The results show that all the participants had a higher ω JAEM value in scenario B. This result shows that participants are able to perform the chopstick activity better when using the training chopsticks. This result is based on the fact that more peanuts were moved in scenario B than in scenario A. We assume that the ω JAEM evaluation index can be used to measure the progress of the participants in their use of chopsticks by measuring the angular velocity of PIP joint angle estimation movement.
Third, we evaluate cognitive ability during CMT at the macroscopic level. Table 3 shows the CMT results of Participant 1 (P1) in the two scenarios. Two evaluation indices, v CAM and v CTM , were measured by calculating the average time T CAM and T CTM , and the distance D CAM and D CTM . The experiment results at the macroscopic level show some change in the attention heatmap. In scenario A, most participants focused more on the source bowl than the destination bowl. In scenario B, the participants were more confident and focused on picking the peanuts instead of placing peanuts in the other bowl [62]. The behavioral  changes in eye focus coincided with the CAM and CTM evaluation indices. We can see that all respondents experienced an increase in v CAM and v CTM in scenario B compared to scenario A.
This evaluation discusses the final results of the three proposed evaluation indices compared to the conventional method of counting the number of peanuts moved (NP). Table 4 compares the evaluation indices for ten participants in the two scenarios. The average increase in the number of peanuts moved by all participants is 48.47 %. Thus, the PIP joint index finger performance increases for all participants before and after training. At the mesoscopic level, the average increase for ω JAEM is 26.79 % or, in other words, the average angular velocity at the PIP joint angle increases by 0.0056 • /ms. At the macroscopic level, the average increase for v CAM is 42.13 %, and the average increase for v CTM is 28.63 %. The experiment results show that the eye tracker moves earlier than the chopstick tip. Hence, we conjecture that visual attention influences finger movement significantly. The behavioral changes in eye focus coincided with the chopsticks' attention movement CAM and CTM evaluation indices. Thus, we estimate that visual attention provides finger joint angle estimation to interpret the behavioral performance and visual strategy better than in the previous study [39]. This result is evidenced by regression and correlation analysis between finger movement speed and eye movement angular velocity. We performed regression analysis [63] to verify the linearity between the v CAM , v CTM and ω JAEM as the VOLUME 10, 2022 three proposed variables with number of peanuts (NP) moved as a conventional method in scenarios A and B. Table 5 shows the Pearson correlation coefficient for ten participants in two scenarios. The regression results show that these two scenarios do not have many different gradient values. The constant value of the regression line equation in Scenario B moves to the top right in Scenario A with a positive value. This result shows an increase in the value for the three proposed variables. The three proposed evaluation indices have linear effects with the NP moved to the traditional method. We con-clude that the system has successfully estimated the finger measurement together with visual attention. Thus, these three evaluation indices can be proposed as indicators of the finger and eye movement progress during CMT.
Next, we undertake a correlation analysis to verify each indices's relationship strength. We analyzed each evaluation indices using the Pearson correlation coefficient (R) [64]. The R-value of the three relationships, R(v CAM , NP), R(v CTM , NP), and R(v CAM , v CTM ), have a positive correlation with each other. The R-value shows that v CAM and v CTM affect the increase in chopsticks skill, indicated by the NP taken in one minute in both scenarios. Likewise, if the visual attention is high, it will reduce the angular finger joint angle because it will be more careful in pinching and moving the peanut. Thus, visual attention generally dominates skill improvement compared to finger movements. These results indicate that visual attention affects the progress of hand-eye coordination performance, especially in CMT activities.
However, another R-value of the three relationships, R(ω JAEM , NP), R(v CAM , ω JAEM ), and R(v CTM , ω JAEM ), has a negative correlation with each other. The R-value shows that ω JAEM has a negative correlation with the NP. This means the increase of ω JAEM will reduce the NP. Likewise, increasing v CAM and v CTM will also relieve the ω JAEM . If the finger moves faster, it will reduce the precision of pinching objects like robotic imitation learning cases [65]. If the angular velocity of the finger is high, it will be more challenging to pick up and move the peanuts. In the case of a person skilled with chopsticks, we can assume linearity. However, in the case of an inexperienced user, it is challenging to assume linearity because the speed increases will exceed the upper limit of saturation, resulting in a reduced success rate. This assumption follows the experiment results, which show that ω JAEM has a negative correlation with NP. Figure 7 shows the regression and correlation analysis in a scatter plot.
Compared to the contact method, the multiscopic approach has several advantages. First, this method estimates finger joint angle by simply using a camera to detect markers on chopsticks. The contact methods might use a sensor representing the CMT on the object, chopsticks, or body. Second, this method may replace the electric glove, which is considered less comfortable to describe the kinematics finger model. Third, the egocentric vision is regarded as the most appropriate for maintaining privacy compared to the other method, and it is more suitable for independent hand rehabilitation using smart glasses. Classifying pinch and move mode is necessary based on time-series gaze analysis [66]. The real-time CMT results can be suggested as an input or sensor for robotic hand prostheses, partner robots, or other health care systems.
Moreover, integrating finger measurement and visual attention in the CMT assessment can result in several benefits that may not be achieved from the conventional method. We obtain more detailed information regarding the causes of success or failure when the patient undertakes a CMT assessment. The causative factor may be seen in the dominant movement of the fingers and eyes [67]. We can acquire information on the increase or decrease in the function of the hand and eye movements simultaneously. This information may help the therapist develop a rehabilitation plan to improve hand-eye coordination. However, there are physical limitations where practitioners must deal with the 2D position of the chopsticks in data collection. So, we need to implement a data acquisition strategy by reducing the degrees of freedom. As a result, this proposed method must still be validated with existing CMT practices in rehabilitation facilities. These three proposed evaluation indices are expected to provide added value to the CMT measurement.

V. CONCLUSION AND FUTURE WORKS
This paper proposes a finger joint angle estimation with visual attention in the CMT using a multiscopic framework. Feature extraction based on the finger model kinematic is developed at the microscopic level. The marker detection algorithm is applied to determine the status of chopsticks and joint fingers at the mesoscopic level. The proposed FCC-NN module is able to learn the bottom and upper chopsticks' length patterns for estimating the PIP joint angle and pinching-moving detection. The learning algorithm increased the accuracy compared to conventional methods and reduced hidden units at the mesoscopic level. Based on the experiment results, we can use a small amount of the angular velocity of JAEM with acceptable accuracy and reduce the processing time. In addition, we investigate the local maxima and minima at the macroscopic level to obtain the velocity of CAM and CTM indices for visual attention and chopstick movement evaluation. We expect this study to benefit therapists and researchers by providing valuable information that is not accessible in the clinic. For future work, we will collect more human samples for further validation and deploy the system for rehabilitation purposes.