huSync - A Model and System for the Measure of Synchronization in Small Groups: A Case Study on Musical Joint Action

Human communication entails subtle non-verbal modes of expression, which can be analyzed quantitatively using computational approaches and thus support human sciences. In this paper we present huSync, a computational framework and system that utilizes trajectory information extracted using pose estimation algorithms from video sequences to quantify synchronization between individuals in small groups. The system is exploited to study interpersonal coordination in musical ensembles. Musicians communicate with each other through sounds and gestures, providing nonverbal cues that regulate interpersonal coordination. huSync was applied to recordings of concert performances by a professional instrumental ensemble playing two musical pieces. We examined effects of different aspects of musical structure (texture and phrase position) on interpersonal synchronization, which was quantified by computing phase locking values of head motion for all possible within-group pairs. Results indicate that interpersonal coupling was stronger for polyphonic textures (ambiguous leadership) than homophonic textures (clear melodic leader), and this difference was greater in early portions of phrases than endings (where coordination demands are highest). Results were cross-validated against an analysis of audio features, showing links between phase locking values and event density. This research produced a system, huSync, that can quantify synchronization in small groups and is sensitive to dynamic modulations of interpersonal coupling related to ambiguity in leadership and coordination demands, in standard video recordings of naturalistic human group interaction. huSync enabled a better understanding of the relationship between interpersonal coupling and musical structure, thus enhancing collaborations between human and computer scientists.

ity to interact with users. These advances are being further 23 propelled with applications in human motion analysis and 24 The associate editor coordinating the review of this manuscript and approving it for publication was Stavros Souravlas .
understanding coordination of human behaviors [1], [2]. With 25 a wide range of methods to track human motion today, 26 there is great potential in utilizing them to understand var-27 ious behavioral aspects and responses of the human body. 28 Humans exhibit phenomenal capabilities in synchronizing 29 joint actions and coordinating at the interpersonal level in a 30 non-verbal manner. This is observed particularly in musical 31 This paper is organized as follows: in Section II, we high-102 light the hypothesis and research questions that are raised, 103 in Section III we present existing computational approaches 104 for the analysis of synchronization and relevant stud-105 ies that have examined interpersonal synchronization and 106 entrainment in small groups, particularly musical ensembles; 107 Section IV describes the huSync computational framework 108 and system as well as an instance of the framework, with 109 a detailed methodology and calculation routine, explained 110 using a simulated example, to compute dyadic synchroniza-111 tion; Section V presents the dataset, with a sub-section ded-112 icated to the implementation of huSync on this dataset and 113 parameters utilized for our use case to perform the analysis; 114 We then present statistical results in Section VI followed by 115 Section VII where we discuss them; We conclude the paper 116 by highlighting limitations and possible future research in 117 Section VIII. 119 Our first objective is to develop a computational framework 120 and a system, for the automated analysis of interpersonal 121 coordination in small groups, considering cases of clear lead-122 ership by an individual member as well as cases of egalitarian 123 leadership distributed throughout the group. In our computa-124 tional approach, we get motor, postural, and acoustic data in 125 a non-intrusive manner, to compute synchronization of motor 126 and postural features by applying consolidated techniques, 127 and to provide outputs which are robust with respect to the 128 different conditions addressed (e.g., either clear or egalitarian 129 leadership). Our second goal is to exploit such computational 130 approach to investigate the effects of musical texture and 131 position within musical phrases, and how it affects interper-132 sonal coordination in a professional music group performing 133 in two constellations that are common in Western chamber 134 music: a string quartet (consisting of two violins, viola, and 135 cello) and a clarinet quintet (i.e., a string quartet with an added 136 clarinet soloist). This is intended to at the same time pro-137 vide evidence of the robustness of the proposed framework 138 and system and increase knowledge of the mechanisms that 139 underly interpersonal coordination in small groups. 140 Musical phrases are analogous to phrases or sentences in 141 speech to the extent that they are meaningful organizational 142 92358 VOLUME 10,2022 units that would be perceived as coherent or complete if 143 presented in isolation. We consider phrases to be sections

198
Analyzing social dynamics and interpersonal synchro-199 nization have been studied in many fields. For example, 200 in psychotherapy settings, studies analyzed temporal changes 201 in global body movement using video-based quantification 202 techniques such as motion energy analysis (MEA), a frame 203 differentiating method, to measure synchrony between the 204 patient and counselor during psychotherapeutic sessions [23], 205 [24], [25], [26]. While MEA is a simple approach, a critical 206 issue noted is that since it quantifies frame-differences based 207 on the region of interest (ROI), it is not sensitive to the direc-208 tion of movement within a ROI. Thus, someone who touches 209 their face often, will exhibit higher head-movement as com-210 pared to someone who does not [23]. During unidirectional 211 face-to-face communications, Yokozuka and colleagues [27] 212 made use of wireless accelerometers attached to the forehead 213 of the speaker and listener to analyze head motion synchro-214 nization and empathy, using phase and frequency differences. 215 The use of instruments attached to the body makes partici-216 pants uncomfortable which impedes naturalistic movements. 217 Among small group ensembles, MoCap systems have been 218 extensively used to study interpersonal coordination with the 219 use of non-linear methods particularly between performers 220 playing music together [ In Burger et al., MoCap data was processed to represent 224 whole-body swaying and bouncing motions among partici-225 pants. Period and phase-locking behavior was observed in 226 full-body music-induced movements by calculating the cir-227 cular mean of movement phases and beat locations for each 228 participant, with results informing our understanding of how 229 humans entrain to music. While data can be captured with 230 MoCap systems at high frequencies, good accuracy, and 231 low noise, such specialized systems can be expensive, pose 232 methodological issues [35], and restrict movement due to the 233 use of tight-fitting motion-tracking suits. Marker-less meth-234 ods are emerging as good alternatives to MoCap systems for 235 synchronization studies in small groups, as seen, for example, 236 in a study in Hadjakos et al. [36], who used a Kinect camera 237 to analyze head movements and study synchronization in a 238 violin duet performance.

239
With huSync, we present a system that instead utilizes a 240 pose estimation algorithm on video sequences and computes 241 Phase-Locking Values (PLV) to study the interaction of social 242 signals in small group setups. As compared to the computa-243 tional approaches discussed above, huSync is a non-intrusive 244 method to study interaction in small groups in naturalistic 245 contexts and eliminates the dependency on any hardware for 246 tracking body movements. PLVs have been used to quantify 247 interpersonal coordination at the level of body motion and 248 brain activity in a wide range of social interaction tasks [37], 249 [38], [39], [40], [41], [42], suggesting that it is a reliable mea-250 sure for studying cognitively mediated contributions to the 251 synchronization process. Indeed, phase locking is generally 252 VOLUME 10, 2022 a pervasive concept in computing interactions in non-linear 253 and complex systems, and PLV in particular, is a commonly 254 used interaction measure in diverse domains [43], [44]   , [50], [51], [52]. which are not technically required for sound production but 303 nevertheless take place during performance (e.g., head nods 304 and swaying of the torso) [53]. 305 Ancillary motion may be the key to understanding social 306 communicative effects of group music making. Results out-307 side the music domain indicate that greater head motion 308 synchronization occurs during moments of high empathy in 309 face-to-face communication [27]. This finding suggests that 310 the degree of empathy can be assessed by the correlation 311 between phase and frequency of head motion synchronization 312 in setups where co-actors are in visual contact. Empathy can 313 be considered to be an innate capacity for understanding 314 others thoughts and feelings, and among the core components 315 that enable musicians to engage socially with one another dur-316 ing performances [54], [55]. Empathy contributes to feelings 317 of social bonding and behavioral contagion among individ-318 uals in groups, leading to higher states of synchronization 319 in upper-body/head movements [27]. Musical ensembles can 320 therefore be considered to be more than groups of synchro-321 nized individuals, but instead as systems for social connection 322 in which empathy facilitates the information transfer between 323 performers by enhancing synchronization states. Rhythmic 324 synchronization of upper body movements and particularly 325 the head is pertinent and sometimes inevitable in a musical 326 ensemble -presumably emerging from high degrees of empa-327 thy, agreement, and shared joint goals.

328
Ancillary body motion also plays a role in regulating 329 an individual's performance, conveying musical structure, 330 expressive intentions, and underlying musical meaning to 331 others in a group or even the audience [ Previous research in small group interactions has demon-337 strated that the coordination of head motion and body sway 338 is positively correlated with coordination of sound onsets, 339 although the relation is not perfect [4], [28], [63], suggest-340 ing that visual and auditory information provide parallel 341 channels for musical communication [57]. Additionally, the 342 synchronization of non-verbal elements of expression takes 343 place across multiple temporal scales, with head motion 344 in particular being associated with higher states of con-345 nectedness [64], [65]. Correspondingly, the head movement 346 synchronization of performers in a group can serve as a 347 good metric to identify whether or not they are perform-348 ing cohesively. in salience (i.e., homophonic textures). It is often assumed 369 that in such cases the melody player serves a leadership 370 role in the ensemble (even if only temporarily) [11], [12], 371 [13]. In other textures, separate parts can each have inde-372 pendent melodic content that proceeds simultaneously (i.e., 373 polyphonic textures). In these cases, the situation is more   computational framework adopted for huSync. It includes 415 four blocks and is grounded on a well-established concep-416 tual framework for the analysis of expressiveness conveyed 417 using body movements and gestures alike [56], [79]. The first 418 block, multi-modal signals ( Fig. 1 (A)), consists of informa-419 tion and data that can be sourced from different modalities 420 (e.g., audio, video, heart-rate, respiration rate, and so on). 421 The second block, feature extraction ( Fig. 1 (B)), entails 422 extracting raw data from these multi-modal signals and could 423 include pre-processing steps (e.g., up or down-sampling, 424 interpolation, realignment, and normalization) to make sure 425 that all signals are compliant with one another when perform- obtained from the feature extraction block (Fig. 2 (B)).   The key-point of interest can be a single key-point or a 494 computed feature between multiple key-points. As part of our 495 feature extraction step ( Fig. 1 (C)), using the data extracted 496 from the json file, we compute the Euclidean distance with 497 the raw coordinate data available in (x,y) format. When 498 processing videos with pose estimation algorithms, the data 499 can be quite noisy and it is important to check if filtering 500 is required. huSync implements the Savitzky-Golay filter, 501 if needed, since it tends to preserve the phase and essential 502 features of a signal [80], [81]. We then ascertain the size of 503 the dataset to be consumed by the huSync model to analyze 504 changes in synchronization level over the time period of 505 interest. In our specific use case, answering the research ques-506 tions raised in section II requires analyzing the start, middle, 507 and end of musical phrases, and hence the total number of 508 datapoints should be divisible by 3 and also adaptive to the 509 step-size chosen in the next step to fit all data points that fall 510 within the window width. When this condition is not met, 511 extra rows in the data file can be dealt with by truncating the 512 dataset at the end of the phrase segment. Additionally, if there 513 are fewer line items, they can be dealt with by augmenting 514 the existing data at the extremities using polynomial or linear 515 extrapolation. While it did not happen in our case, if loss of 516 information is observed in between a phrase segment, it can 517 be dealt with by making use of a cubic spline interpolation to 518 fill gaps [82]. We use a sliding window approach that steps through each 522 portion of the video data so to capture both local and global 523 trends. In our simulated example we use a window size = 524 5 and step-size =2, and thus we have 6 windows.  As illustrated in our simulated example, once FFT is applied 547 over the data of Participants 1 and 2, from the complex values 548 we extract the phase angles. After obtaining the phase angles for each participant, we then 552 proceed with computing relative phase angles (difference 553 between the phase angles) for all possible pairs, and for our 554 simulated example it will be between the two participants -by 555 computing for each time step and frequency bin the difference 556 between phase angles of the participants. After we have obtained the PLVs, which as seen in the 573 previous step will result in an array having a length equivalent 574 to the length of a single window, since we have one PLV for 575 each frequency bin. Here, a cut-off frequency can be utilized 576 to discard frequencies beyond a threshold, while excluding 577 the DC component for the computation. As seen in Fig. 5, 578 once the PLVs are calculated, we average them to obtain 579 a single value (avgPLV or averaged PLV), across different 580 frequency bins of interest, and is our final value for dyadic 581 synchronization between a pair. Here, PLV and averagedPLV 582 are computed using (3): pre-defined parameters, covered in section V-B. In Table 1, 622 we summarise the full dataset and specific phrases selected 623 in terms of phrase duration (min, max, median and average) 624 and count. 2) Musical texture (polyphonic, where leadership is 638 ambiguous due to the lack of a clear distinction 639 between melody and accompaniment, versus homo-640 phonic, where there is an unambiguous melodic 641 leader).

642
Each of the videos was annotated using ELAN (an annota-643 tion tool for multimedia files) [84] in accordance with a musi-644 cological analysis based on the published score. In order to 645 mitigate noise that can be introduced by personal behavioral 646 aspects of performers before or after a phrase has been played 647 such as shaking the legs, rotating the arms, or readjusting their 648 seating position, the annotations should be made carefully 649 and be aligned as accurately as possible with the start and 650 end of musical phrases. Annotated features included phras-651 ing, textural classification, number of instruments currently 652 playing, and instrument roles (e.g., melody, counter melody, 653 or harmonic accompaniment), which were indicated in sepa-654 rate tiers in the ELAN interface. Information from each tier 655 within the annotated ELAN file for each piece was exported 656 to extract video timecodes for each phrase and its textural 657 classification.

658
Phrases were selected based on the following criteria:     Table 1 reports the number of selected phrases and their 696 min, max, median, and average duration. are arranged as a table with x and y coordinates for each 706 participant in separate columns. We analyze dyadic synchro-707 nization for all pairs of performers and the total possible dyad 708 combinations is 6 for Borodin (n=4, r=2) and 10 for Brahms 709 (n=5, r=2). We did evaluate the use of a Savitzky-Golay filter 710 for our data, but did not observe any major differences with 711 its use and decided to exclude it during the data processing 712 phase. Using the coordinate information, the Euclidean dis-713 tance between each time step of the trajectory is computed for 714 every participant and arranged in separate columns. We then 715 proceed with using a sliding window to segregate our data 716 for each participant. Based on previous studies, tests were 717 performed by varying the duration or size of the window to 718 inspect our data across multiple levels of temporal resolution 719 and statistical significance, and decided to proceed with a 720 window size of 30 and step-size of 5 [85]. Based on our 721 window step-size the dataset had to also be divisible by 5 to fit 722 all data points by the window width. We truncate the data in 723 case of extra data-points and extrapolate to fill missing values. 724 For example, if our dataset contains 453 data points, we will 725 truncate it to 450 to arrive to the nearest multiple of 5 and 3, 726 and in case we have 447 data points, we extrapolate 3 data 727 points to arrive to 450. On applying FFT on the windowed 728 distance data, we extract the phase angle and begin to com-729 pute relative phase angles for all possible pairs. By analyzing 730 the frequency distribution, and using a window size of 30, 731 a 10Hz cut-off indicates excluding all values above the 11th 732 value and excluding the 1st since it is the DC component. PLV 733 is computed for each frequency bin and then averaged across 734 all frequency bins of interest.

736
We performed our analyzes on a total of 44 phrases and in 737 Table 1 we share a group summary of the dataset chosen. 738 These phrases met our criteria of a good balance between 739 polyphonic and homophonic textures while also taking into 740 account the duration of each phrase and quality of the data 741 received on pre-processing videos with a pose estimation 742 algorithm.

743
The PLV results are first presented descriptively and 744 then results of analyzes of Variance (ANOVA) are reported. 745 Performances of the Brahms and Borodin pieces were 746 analyzed separately due to the differing number of per-747 formers in each piece. PLVs for all pairs for each piece 748 were entered into an ANOVA that included Phrase Position 749 (Start, Middle, End) as a within subjects factor and Tex-750 ture (Homophonic, Polyphonic) and Pair (i.e., each separate 751 VOLUME 10, 2022  The ANOVA results are illustrated for the Brahms perfor-782 mance in Table 2, and for Borodin in Table 3. Values high-     Overall, these results indicate that for both pieces, PLVs 794 were reliably higher-hence interpersonal coupling between 795 performers was stronger-for polyphonic than homophonic 796 textures, though this effect of texture varied over the course 797 of musical phrases. Specifically, the effect of texture was 798 reduced at the end of phrases due to decreases in coupling 799 strength in polyphonic textures and increases in coupling 800 strength in homophonic textures.

802
While our main analysis focuses on ensemble coordination 803 of co-performer body motion, we conducted an additional 804 analysis to examine the relationship between the synchroniza-805 tion of body movements, which provides visual cues, with 806 ensemble sounds. , we included estimates of 812 'pulse clarity' and 'event density', which were calculated 813 using the 'mirpulseclarity' and 'mireventdensity' functions 814 from the MIRtoolbox in MATLAB [87]. Pulse clarity is a fea-815 ture that reflects the strength of rhythmic beats, while event 816 density indicates the average frequency of events (i.e., the 817   Table 4.

819
To assess potential effects related to these audio features, 820 we ran a linear mixed effects model analysis using the 821 lmer package [88] in R [89] with PLV as the dependent [47], [63]. Future work with multitrack audio would allow 839 the relationship between auditory and visual information to 840 be investigated in greater detail, including the assessment 841 of correspondence between leader-follower relations across 842 modalities.

844
The current study had two prime objectives. The first was to 845 develop and present a computational framework and a system 846 to study small-group interactions involving non-verbal social 847 communicative behaviour. huSync can be implemented on 848 video sequences which permits studies to be performed in 849 a naturalistic context without interference associated with 850 motion capture setups. Second, we wanted to put huSync 851 through a test case scenario addressing research questions 852 concerning the relationship between interpersonal coordi-853 nation of body movements and musical structure. For this 854 specific use case, huSync appears to be a practical alternative 855 technique for quantifying dyadic synchronization between 856 co-performers in musical ensembles based on the automated 857 analysis of human body movements. The outcomes of this 858 investigation are thus methodological and empirical in nature, 859 informing technical aspects and conceptual issues relevant 860 to examining real-time human interaction and non-verbal 861 communication in naturalistic settings.

862
On the methodological side, our approach progresses 863 through a structured funnel of steps, where kinematic infor-864 mation is gathered from standard video recordings in a 865 marker-less and non-intrusive manner. This kinematic infor-866 mation is then used for quantifying dyadic synchronization 867 between musical performers from within a group ensemble, 868 indexed as phase-locking values, and this routine is done 869 exhaustively for all possible pairs in the group. An advantage 870 of this approach is that it is possible to obtain information 871 about coupling between specific individuals whereas if we 872 take a global measure, we do not necessarily have that level of 873 VOLUME 10, 2022 specificity. The alternative is complicated and rather difficult 874 to interpret when data pertain to natural behavior (in contrast 875 to data from controlled experiments where independent vari-876 ables are systematically manipulated).

877
As an empirical case study, we applied the above tech-878 niques for body motion analysis to investigate the effects 879 of two aspects of musical structure-texture and phrase contact across phrase positions [12], [51], [90]. 918 We evaluated huSync as a system to quantify group coor- Additionally, it highlights the relevance of both visual and 930 audio cues when assessing interpersonal synchronization in 931 musical groups. Overall findings suggest that huSync is sensi-932 tive to modulations of interpersonal coupling related to ambi-933 guity in leadership and coordination demands in standard 934 video recordings of naturalistic human group interaction.

936
The proposed 'huSync' framework and system provides 937 a reliable and non-intrusive alternative to current meth-938 ods for the automated analysis of human body movements 939 and associated qualities such as degrees of interpersonal 940 synchronization. It can help in the study of such niche but 941 ecologically valid aspects of human movement sciences, 942 opening an avenue where marker-less technologies can be 943 utilized extensively. This is evident in the use case of musical 944 ensemble performances, where we evaluated the method, and 945 also has potential to be extended to capturing non-verbal 946 social signals in other domains of group behaviour and 947 human interaction more generally. As a concrete outcome, 948 we provide a well-structured jupyter notebook (link) that 949 includes functions designed and implemented to process the 950 data extracted from pose estimation algorithms by converting 951 them into structured csv files, followed by the calculation 952 routine for computing phase locking values, thus quantifying 953 the dyadic synchronization. An especially promising benefit 954 of the huSync model is that it can be applied to standard 955 videos recorded across a wide range of contexts, opening the 956 door to analyzing vast troves of historical material available 957 in archives and on the Internet. The outcomes of the research 958 will thus potentially have broad impact across diverse disci-959 plines including computer science, psychology and cognitive 960 neuroscience, and music psychology. The methodological 961 applications of huSync can be leveraged for further empirical 962 discoveries related to human joint action, group behavior and 963 social cognition [10]. There are several areas to improve upon and overcome in 966 future research. At present, there exists a higher amount 967 of noise in tracking conventional video as compared to 968 marker-based systems. This issue becomes particularly acute 969 when examining at higher-order kinematic variables, such 970 as velocity and acceleration (because computing derivatives 971 via differentiation amplifies noise), which is one reason why 972 we focused on distance data. Pose estimation algorithms 973 provide better results with regard to recognizing, isolating, 974 and predicting the pose of participants in videos where the 975 foreground and background are well-differentiated. This sug-976 gests that figure-ground differentiation is an important aspect 977 of quality control.

978
Additionally, the seating position and direction of motion 979 trajectories exhibited by participants is an important aspect 980 to take note of, and plays an influential role in quantifying 981 dyadic synchronization. For our use case, the head moves 982 predominantly in a back-and-forth manner during moments 983  2) Granger Causality to quantify mutual influence / 1033 leadership by studying the directionality of coupling 1034 (which should be more evident when there is a 1035 clear leadership hierarchy, as in homophonic textures), 1036 helping us look at effects of musical structure on 1037 group coordination and communication simultane-1038 ously, at short timescales related to musical beats 1039 and longer timescales related to expressive body 1040 sway [50], [95], [96].