Quality of Experience in Telemeetings and Videoconferencing: A Comprehensive Survey

Telemeetings such as audiovisual conferences or virtual meetings play an increasingly important role in our professional and private lives. For that reason, system developers and service providers will strive for an optimal experience for the user, while at the same time optimizing technical and financial resources. This leads to the discipline of Quality of Experience (QoE), an active field originating from the telecommunication and multimedia engineering domains, that strives for understanding, measuring, and designing the quality experience with multimedia technology. This paper provides the reader with an entry point to the large and still growing field of QoE of telemeetings, by taking a holistic perspective, considering both technical and non-technical aspects, and by focusing on current and near-future services. Addressing both researchers and practitioners, the paper first provides a comprehensive survey of factors and processes that contribute to the QoE of telemeetings, followed by an overview of relevant state-of-the-art methods for QoE assessment. To embed this knowledge into recent technology developments, the paper continues with an overview of current trends, focusing on the field of eXtended Reality (XR) applications for communication purposes. Given the complexity of telemeeting QoE and the current trends, new challenges for a QoE assessment of telemeetings are identified. To overcome these challenges, the paper presents a novel Profile Template for characterizing telemeetings from the holistic perspective endorsed in this paper.


I. INTRODUCTION
More than 150 years after the invention of the telephone, state-of-the art features such as video transmission and screen sharing prove that today's telecommunication technology has evolved well beyond mere speech-based, audio-only communication. With the advent of modern eXtended Reality (XR) technologies (i.e., virtual, mixed, or augmented reality) even more natural or more immersive telecommunication experiences are possible.
Human-to-human interaction over a telecommunication system, also referred to as mediated communication, is part of our daily life, both in professional and private contexts. Considering the different societal, economic, climate and technological changes over the last couple of decades, the relevance of such systems is still increasing.
Moreover, users are confronted with many different technical possibilities to communicate remotely, but are often experiencing high cognitive load and fatigue during such mediated communication sessions (see e.g., [1]). Accordingly, the demand for high-quality mediated communication is large and increasing, which in turn translates into system quality requirements, both from a service provider's and user's perspective.
In this context, a systematic analysis of the Quality of Experience (QoE) [2] of telecommunication systems, as it is perceived by the user, can help developers and service providers to improve their solutions and services. The notion of quality was already included in the old patents on telephone technology. For example, according to Richards [3], the original patent by Edison from 1877 stated that the carbon microphone was much better sounding than the initial design by Bell from 1876. Since then, quality and ultimately QoE assessment have developed to a well established discipline in the telecommunications sector (see e.g., [4]).
However, with the developments mentioned above, new systems bring additional challenges and opportunities for both the user and the service provider. Therefore, existing QoE assessment approaches need to be continuously extended to new types of systems as well as new user expectations. For that reason, academic and industrial research as well as telecommunication standardization bodies are highly active, not only in developing new telecommunication solutions but also in developing corresponding new QoE assessment methodologies.

A. CONTRIBUTIONS OF THE PAPER
This paper provides the reader with an entry point to this comprehensive and still growing field of QoE assessment of mediated communication, with a focus on modern and nearfuture telemeeting systems as defined in Section II-B. To this aim, a structured survey of relevant scientific literature is presented, systematically considering a large number of aspects that are needed to understand the QoE of telemeetings.
To illustrate the different aspects of telemeetings, Figure 1 visualizes a telemeeting system with its technical main components, connecting multiple participants with different in-  (5) Introduction of the Telemeeting Profile Template, that is, a tool that provides a set of quantifiable criteria for telemeeting QoE evaluation. This allows to systematically and holistically characterize telemeetings from a QoE perspective, and select the right approach for a given assessment task, see Section VIII.
Moreover, the paper takes the complexity of QoE (see Section II-A) into account, by deliberately discussing both technical and non-technical aspects. This holistic perspective taken by the paper addresses both researchers and practitioners. Based on their multidisciplinary expertise, the authors are convinced that this holistic presentation will help the technical experts to further improve telemeeting systems or to develop new methods, in particular for a technology-oriented assessment of telemeeting QoE, while well reflecting the different application-scenarios and hence user and context factors. Along this line of thought, the paper identifies relevant, possible links between non-technical aspects and potential or proven technical approaches, to consider these for further system or evaluation-method development.
Summarizing, it is argued that approaching telemeeting QoE from a holistic perspective has a number of benefits that foster further progress in both technology and QoE assessment of telemeetings. VOLUME 4, 2016 Table 1 shows the structure of the rest of this paper, which consists of eight main sections and a number of respective subsections, that are organized as follows. In Section II, the paper starts with a background on terms and concepts of telemeeting QoE. Then, Section III provides a detailed explanation of the process that the authors used to conduct the survey. Next, each of the following sections form one of the five contributions of the paper. First, Section IV structures relevant mediated communication quality aspects in terms of QIFs. Second, Section V takes a closer look at the QoErelevant processes. Third, Section VI provides an overview of current methods for telemeeting QoE evaluation. Fourth, Section VII gives an overview of current development trends in telemeetings, focusing on XR-based technology. Fifth, Section VIII presents a new approach for structuring telemeeting QoE assessment. For that purpose, a Telemeeting Profile Template is proposed to streamline the knowledge from the previous survey sections, indicating how the large body of QoE-related aspects can be applied for a holistic QoE evaluation of current and future telemeetings. Finally, a few closing remarks in Section IX conclude the paper.

II. TERMS AND CONCEPTS A. DEFINITION OF QUALITY OF EXPERIENCE (QOE)
Based on [2], Quality of Experience (QoE) has been defined by the ITU-T as "The degree of delight or annoyance of the user of an application or service" [10]. That means, QoE is a construct that is formed inside a person's mind, see e.g., [2], [6], [8], and is based on the person's experience with an application, service or event, which here is a telemeeting. Accordingly, QoE is to be considered as a complex cognitive construct, resulting from technical aspects of a telemeeting system, but strongly influenced also by numerous other human and contextual aspects, see e.g., [2], [5]. The paper reflects this complexity by providing a systematic approach towards a holistic perspective on telemeeting QoE as already described in Section I-A.

B. DEFINITION OF TELEMEETINGS
Contemporary telemeeting systems can be realized in numerous ways, ranging from telephone conference bridges over audiovisual computer-based solutions and high-end telepresence rooms to systems using virtual, mixed, or augmented reality (often referred to as eXtended Reality, XR). The solutions can differ in various aspects such as specific devices, transmission technologies, collaboration and management features, and more. Correspondingly, a variety of terms are used to refer to such systems.
While those terms often give a principal idea about the system characteristics in general, they are hardly formally defined and sometimes even inconsistently used across contexts. For example, the difference between a telephone and a telepresence room is rather clear -at least for people that have used both systems. In turn, the term conferencing system usually refers to multiparty communication scenarios, while the term video conferencing system is also often used for a one-to-one video communication system.
To account for such aspects and to have a single term summarizing such different telecommunication systems, the term telemeeting is promoted by the International Telecommunication Union (ITU). A dedicated work group of ITU-T, ITU's telecommunciation sector, addresses telemeeting QoE, namely Study . In its task description [11], telemeeting is used to cover with one term all means of audio or audiovisual communication between distant locations. This is similar to the formal definition given in [12]: A meeting or conference at which people in different locations participate by means of telecommunications technology.
While this term is quite encompassing for all kinds of telecommunication systems, the two mentioned definitions suggest some limitations. These limitations are clarified in the following to better specify the scope of this paper.
First, a telemeeting uses speech as the primary communication modality, which then may be augmented by other modalities such as video, images, text, or in case of virtual or augmented reality also haptic and olfactory information. Thus, any sole means of communication without speech are not considered as a telemeeting system. An exception to this are telemeeting systems for hearing impaired people, which use text or video to replace the missing audio channel.
Second, to qualify as a telemeeting, the system should allow for bidirectional communication between the participants, meaning that unidirectional transmission as in radio or television broadcasts are not considered as telemeetings.

III. METHOD
One underlying goal for conducting this survey was to create a scientific basis for the Telemeeting Profile Template, a tool for a systematic characterization of telemeetings from a QoE perspective, see also Section VIII. In that respect, the literature survey was conducted hand in hand with the development of said tool as follows: The author team compiled relevant literature in an iterative process, combining a bottom-up and top-down approach. The goal of this procedure was to obtain a comprehensive list of relevant aspects from the literature and practical experience (bottom-up path), and to identify a structure for this list of aspects (top-down path).
As part of the bottom-up path, the authors compiled a set of individual aspects that are relevant from a QoE perspective. In the paper, these are referred to as Quality Influence Factors (QIFs), cf. e.g., [2], [5]. The authors used different sources for compiling respective literature: First, all members of the author team included citations used in earlier work in the field that each co-author was aware of, based on their individual long-term expertise in the field. Second, dedicated literature search queries were conducted for each factor that was not already covered by the first compilation of literature, or when the authors saw the need to better understand a factor. Last, whenever feasible, the authors traced back original papers that were cited in above mentioned sources.
For some factors the scientific evidence was quite strong and direct, e.g., when studies found that a factor influences quality ratings. For other factors the evidence was more indirect in the sense that a factor has been shown to influence the communication, which in turn influences QoE. Accordingly, publications showing such a direct or indirect relevance of a factor on QoE were included in this survey; corresponding references can be found in Tables 2 to 5 in the Column Quality Relevance. For further factors, the scientific evidence was less clear, e.g., when the aspect or similar terms have been mentioned in the reviewed literature, but little more information or evidence with respect to QoE was given. In many such cases, the author team identified background information which could serve as a starting point for further study; corresponding references can be found in Tables 2 to 5 in the Column Background. However, there were still many factors that the authors considered as highly relevant from a practical experience, but for which no dedicated scientific literature was found; those cases are indicated by the "-" symbol in Tables 2 to 5. As part of the top-down path, the authors had regular telemeetings in which they discussed a possible structure and the completeness of the list of factors, starting from the three main categories of QIFs and building on the different expertise of each co-author, to account for the fact that telemeetings can differ in various aspects. In addition, the authors collected feedback on intermediate versions from further experts from ITU-T Study Group  served to refine the list as well as to confirm the practical relevance of those aspects for which little scientific evidence was found.
As explained above, much of the literature search for this survey was conducted with a focus on QIFs (Section IV). The combination of domain knowledge of the author team and dedicated literature search on individual aspects was also used for the remainder of the paper, i.e., the survey on QoE-relevant processes (Section V), the overview of widely adopted assessment methods of telemeetings (Section VI), and the discussion of current trends concerning telemeetings (Section VII).

IV. SURVEY ON QUALITY INFLUENCE FACTORS (QIFS)
To structure the numerous aspects QoE is influenced by, the concept of Quality Influence Factors (QIFs) with the three main categories Human Influence Factors (HIFs), System Influence Factors (SIFs) and Context Influence Factors (CIF) has been introduced. In [2], a QIF is defined as: "Any characteristic of a user, system, service, application, or context whose actual state or setting may have influence on the Quality of Experience for the user." Reiter et al. [5], for instance, discuss a number of factors for QoE assessment in general; Akhtar and Falk [20] briefly summarize QIFs that should be considered in audiovisual multimedia quality assessment; Bouraqia et al. [21] give an overview of QIF for video streaming applications, and Seufert et al. [22] provide a more detailed taxonomy of QIFs for HTTP-based adaptive streaming technology. Highly relevant for telemeetings and this paper are the publications by Baraković Husić et al. [23], who give an overview of QIFs for unified communication systems (i.e., integrated services combining telemeeting functionality with asynchronous communication means), and Vucic and Skorin-Kapov [24], who review a number of QIFs in the context of mobile audiovisual telemeetings. As all these publications show, the list of possible QIFs can become quite long. For that reason, this paper provides a more detailed structure based on the three main categories of QIFs as follows.
To start with, it should be noted that the three main categories of QIFs are not always fully separable, see for instance [5]. For that reason, the proposed structure also allows for categories that represent two or even all three types of influence factors, which are here referred to as Mixed Influence Factors (MIF).
To further structure the list, a second grouping hierarchy was added using sub-categories. Individual QIFs are grouped into these sub-categories whenever they share some aspects, such as the State Inside Individual Participants as a subcategory of the group Human Influence Factors. Figure 2 represents the categories and the sub-categories in a Venn-type diagram that visualizes the possible overlap of the QIFs. Note that not all individual factors are plotted here, to enable better readability. Instead, the full list of QIFs is given in Tables 2 to 5, while even more detailed information about the individual factors can be found in the supplementary material of this paper [25], [26]. The next four sections provide a survey of respectively System, Context, Human and Mixed QIFs.

A. SYSTEM INFLUENCE FACTORS (SIFS)
System Influence Factors (SIFs) refer to the technical characteristics of a system that influence QoE. According to the Qualinet white Paper [2], SIFs can relate to content, media, network and devices, and refer, for example, to aspects ranging from signal capture over transmission to reproduction. This kind of signal processing perspective will be taken in Section IV-A1. Additional SIFs, which refer to technical characteristics concerning the user interaction with the system, are addressed in Section IV-A2.

1) SIFs related to the Signal Transmission over the System
This section outlines the general processing and transmission stages for the audiovisual signals between the different connected sites of a telemeeting. In that respect, this section refers to the following sub-categories of SIFs according to Table 2: Media richness aspects, Processing aspects, Network access and topology aspects, Time aspects. Since many different instantiations of telemeeting systems are possible, a generalized perspective is taken here. As the paper addresses quality and QoE, the descriptions in this section focus on the question of "What is happening to the information along the way between participants?" rather than on the question of "How are the processing and transmission steps realized?" A typical approach in communication and media technology is to consider the end-to-end chain as a channel from the source/sender to the sink/receiver. In the case of interactive two-party scenarios like in traditional telephony, this representation typically includes the end-to-end chain in both directions and any signal paths between them. This allows to include interaction-related aspects such as the impact of transmission delay or signal processing stages that require the consideration of signals in both send and receive directions such as echo cancellation. Examples for telephony are given in [95], [7] and [96], for video telephony in [97]. In [98] and [28,Chap. 6.3], the authors also extended such considerations for multiparty scenarios and proposed an approach to analyze such multiparty settings in more detail and from a QoE perspective.
In this paper, a simplified view is used: Figure 3 visualizes the major components of a telemeeting system connecting N sites. At each site, one or more persons are located in a certain environment. Moreover, a number of additional objects of interest may be present in one or more of these environments, such as a physical whiteboard or a physical object that is the topic of the discussions, such as a prototype system or the like. From a QoE perspective, a relevant factor is the degree to which wanted as well as unwanted information from the participants, about the objects of interest and about the environment is transmitted over the system. This leads to a number of SIFs that are related to the media richness provided by the system, such as the auditory and visual representation of participants and environments.
To connect each site to the telemeeting, one or more end devices may be used. The end devices perform a number of processing steps, which are subsumed as a SIF in the sub-category Processing aspects. In modern telemeeting solutions, these steps usually consist of a number of subprocesses, which in addition are often interlinked. For example, the encoding and decoding of signals often combines signal processing with networking-specific mechanisms, and it can be carried out at different places, e.g., as part of the capture & reproduction, the signal enhancement, the network access, or the mixing stages.
Moreover, not all components are used in all system instances. For example, conventional telephone conference bridges use a central conferencing bridge on a server in the network, that is, no mixing blocks are needed in the end devices. In another example, the system connects the N sites using a client-side bridging technology, omitting the need for a central mixing bridge. In this case, the signal enhancement steps might be connected not only to the capture, reproduction and coding steps, but to mixing steps as well.
As these examples illustrate, differences between individual systems can be quite large, which makes it hard to come up with a general picture at a more detailed level. For that reason, Figure 3 simplifies this by showing five major processing steps in the end device: Capture & Reproduction, Signal Enhancement, En-/De-Coding, Signal / Stream Mixing (in case of a peer-to-peer system as it requires clientside conferencing bridges), and Network Access. There is a vast amount of literature on the available technologies, ranging from electro-acoustic and electro-optic transducers (e.g., [69], [99]) over signal processing algorithms for signal enhancement and data compression (e.g., [36], [53]- [60], [63]- [68], [70]- [72]) to data error correction mechanisms for the packet streams (e.g., [100], [101]). Next to research work focusing on the technology, a number of publications discuss these technical aspects of telemeeting and telecommunication systems from a QoE perspective, e.g., [7], [51], [52], [61], [62], [95]. Moreover, the numerous standards of the International Telecommunication Union (ITU) and the Moving Pictures Experts Group (MPEG, now ISO/IEC JTC 1/SC 29) are a rich source of detailed material for both technology and quality assessment standards [48]- [50].
From a QoE perspective, the methods of mixing the signals including any additional signal or data processing is a relevant factor. This holds for both network conferencing bridges and client-side conferencing bridges. When comparing client-side and central bridging technologies, a typical QoE-relevant difference is that central bridging usually requires additional transcoding steps. In turn, client-side bridging may not need this, but requires higher computational power in the clients and better network connections for the multiple streams.
Finally, all sites and any network bridges are connected over a network, whose characteristics (e.g., bandwidth, round trip delay, queuing strategies of routers, used network protocols) may influence the transmission of the packet streams along the delivery chain. Obviously, this can impact QoE, e.g., when end-to-end transmission delays get too long for fluent conversations, or when packet losses occur during unreliable transport and media payload is being lost.

2) SIFs related to User-System Interaction
In the previous section, the focus was on the exchange of information between participants over the system. However, a further component is to consider the interaction of the participants with the telemeeting system to achieve this information exchange. This relates to the disciplines Human Computer Interaction, Usability Engineering, and User Experience Design. While many publications for research, teaching, and practise are available in these fields, these are only partly related to the scope of this paper on aspects that contribute to the QoE of telemeetings. Three relevant publications are, for example, [102], [89] and [90], which address complexity challenges such as possible information overload, interface design, and system structure. Moreover, in order to structure this vast field, we identified four different types of interaction with a telemeeting system or behavior when using it: setting up the system, interacting with the system's user interface during a telemeeting, choosing a communication medium, and adapting the user behavior to the system characteristics. With respect to the different categories of QIFs, the first three types of interaction can be approached from a System Influence Factors perspective. In that respect, the next sections IV-A2a to IV-A2c refer to the following subcategories of SIFs according to Table 2: Operational aspects -setting up a telemeeting, Operational aspects -controlling an ongoing telemeeting, and Media richness aspects. The last type of interaction, adapting the user behavior to the system characteristics, relates more to the Human Influence Factors and will therefore be addressed later in Section IV-C3. a: Operational SIFs -Setting up the system Classical telemeeting solutions such as telephone conference bridges or fixed high-end telepresence rooms are systems which are prepared and set up by experts beforehand. While such systems still are in use today, they are increasingly complemented and partly superseded by individually used set-ups. With such legacy systems, often the participants just needed to dial in (telephone bridge) or use some control interface (telepresence room) to start the connection, but they were hardly requested to set up and configure the system and the connections as such. Looking at typical stateof-the-art telemeeting solutions, however, the situation is quite different: Today, participants are often also tasked with setting up the system or at least parts of the system. For example, common software-based solutions allow to connect VOLUME 4, 2016 to the telemeeting using different devices such as (laptop) computer, tablet or mobile/smart phone and they allow to connect extra headsets, handsfree terminals, cameras, and screens. In such scenarios, the participants need to select the proper audio and video devices, check the settings both in the telemeeting application and the operating system of the device, adapt the volume of microphone and loudspeakers, choosing an appropriate local (wireless) network, etc. From a QoE perspective, this complexity of setting up and configuring the telemeeting system is highly relevant. This is particularly the case when the participants encounter any problems concerning the system setup, for example when this happens just before a telemeeting or when it happens often.
Next to such problem-oriented influences on QoE, today's configuration possibilities may also contribute to a positive QoE by empowering the user to do things on their own. On the one hand, modern telemeeting systems have automated so many technical steps that it is actually possible for nonexperts to carry out the set-up on their own. On the other hand, once users acquired sufficient experience and practice with the system, as many people will likely have during the Covid-19 pandemic, it becomes easier to solve most problems on their own, or enables users to give advice to other participants. To the best of the authors' knowledge, there is no research work published on the impact of a telemeeting system's setup complexity on QoE, with the exception of the related work in [89]. Hence, future work is required to further investigate this aspect. Modern software-based telemeeting systems support multiple features beyond audio and video communication, such as screen sharing, annotation features, text chat and/or the management of participants. Note that systems often differentiate between users who are hosting the telemeeting and those who are participating. The hosts usually have more possibilities to interact with the system than the other participants. For example, in some systems the host needs to give screen sharing permission to others, or the host can define, which additional features can be used. Thus, users of modern telemeeting systems are not only requested to set up the system before a telemeeting (see Section IV-A2a), but often they are also required to control the system during the telemeeting.
For that reason, the service providers or application developers are faced with key questions from the domains of User Experience Design and Usability Engineering, such as: How to design the user interface of the telemeeting system in such a way that hedonic and pragmatic needs are fulfilled? Hassenzahl and Tractinsky [103] go beyond this focus on solving problems and needs and recommend to "design for pleasure rather than absence of pain".
With a focus on mobile phones and services, Park et al. [91] approached the topic of designing a good user experience from an analysis perspective. Based on a literature review, interviews and an observation study, they identified a comprehensive list of sub-elements of User Experience and grouped them into three categories: usability, affect, and user value. This list reflects the resulting effects rather than the causes, and consists of aspects such as simplicity, effectiveness, learnability, flexibility, etc. Further work could obtain more insights about how telemeeting aspects contribute to these items, and in turn to User Experience and QoE.
Concerning the link between User Experience and Quality of Experience, Wechsung and De Moore [104] discussed the general similarities and differences between these two concepts. A short characterization of both concepts in form of a table can be found in the appendix of that publication, which is publicly accessible online [105].
Focusing on software applications running on mobile phones, among them also communication apps, Ickin et al. [106] obtained a number of insights on QIFs. Two of such factors were the performance and the user interface design of the applications. For the latter factor, the study participants reported issues such as locations and sizes of buttons, resizing and scrolling problems, or inefficient manual input. Ultimately, the choice of which application will be used in a given situation may be affected by such aspects, as well as the more communication-and media-transmission type characteristics.

c: SIFs related to the Choice of the Communication Medium
As mentioned in the beginning of Section IV-A2b, users of state-of-the-art telemeeting systems have the possibility to choose between different communication modalities: audioonly or audio with video, additional functions such as screen sharing, text chat, file transfer, joint document editing, etc. Moreover, users can also combine or switch between these modalities during the telemeeting.
Next to the user-interface-design perspective taken in the previous section, one can also look at the impact of this flexibility from a more contextual point of view: When, why, and how do participants select a specific one from those different communication modalities and features?
For such questions, concepts building on the Media Richness Theory could form a starting point. According to the theory proposed by [29], different types of media can be categorized by the richness of information they provide, for example with text being of less richness than video. This theory was originally developed in [29] with a focus on communication in management contexts, and it was developed at a time when many of today's communication features were far from being suitable for mass market introduction, either due to technological, societal, or financial reasons. Consequently, studies have revisited those concepts over the years for newly emerged communication technologies and for different contexts. In [30], for example, it was concluded that remote working teams would actually benefit from being able to select between differently rich media according to the tasks at hand and the people's cognitive styles (i.e., the way how they formulate and process concepts and information),  [109], [110] [111], [112] Optical/lighting situation [110] [113] Time Aspects Temporal changes of the context, e.g., difference in usage time - [114] Time of the day for participants in different time zones --Notes: The column Quality Relevance cites either empirical studies directly investigating the factor's effects or publications discussing the relevance more from a theoretical point of view. Moreover, the relevance can refer either directly to QoE or to perception or communication aspects, which in turn are relevant for QoE. When no references are given, the factor is considered to be relevant from a practical perspective and requires further study. The column Background provides pointers to further literature.
as opposed to a general advantage of a "higher" media richness.
The discussion so far refers to situations in which individual participants are required to choose an appropriate communication channel. However, there are also situations in which it is not the task of the individual but of the telemeeting host to take this decision. Examples for such cases are virtual classroom scenarios, in which the teacher chooses the communication channel according to the didactic needs and permitted by the available resources. Other examples are virtual discussions or standardization meetings with a large number of participants, in which the meeting chair can opt to limit the communication channels upfront, e.g., to enforce a more formalized communication behavior of participants.
At first glance, the act of choosing a proper communication medium suggests that these considerations fall under the category Mixed Influence Factors (see Section IV-D and Table 5), which is true for aspects such as the user's knowledge about the system capabilities and limitations. One can also take a technology-driven perspective here, emphasising that a number of technical characteristics determine the media richness that the system is able to provide. Accordingly, such aspects are collected here as Media Richness Aspects, a subcategory of the System Influence Factors, see Table 2.

B. CONTEXT INFLUENCE FACTORS (CIFS)
Context Influence Factors (CIFs) refer to the contextual characteristics, more specifically to the physical, temporal, social, economic, task and any technical and information context, that influence QoE [2], [5]. With respect to telemeetings, CIFs essentially refer to the overall situation in which the telemeeting takes place. This means, not only the physical environments at the connected sites and temporal aspects play a role, but also the communication scenario and use case as such. In that respect, this section refers to all four sub-categories of Context Influence Factors in Table 3: Use Case, Communication Scenario, Communication Environment, Time Aspects.
It should be noted that this section touches only briefly upon the three latter types of factors, while a major part of this section concerns the use case and more specifically the topics of telemeeting purposes and collaborative working. The motivation for giving these topics more room is to provide a foundation for future work on a better understanding of the kind of situations in which a telemeeting is a suitable or even the most suitable choice of communication medium.

1) CIFs related to Environmental and Temporal Contexts
In the field of standardized quality assessment, the physical context, i.e., the communication environment, is usually considered by defining and setting requirements for the acoustical and lighting situation to be met when conducting a quality assessment test, see e.g., [111]- [113]. Example studies that have investigated the impact of the acoustical and lighting situation on QoE are presented in [109], [110].
With respect to the temporal context, to the best of the authors' knowledge, little research has been conducted on the impact on QoE when participants are located in different time zones or when there are differences in the context due to different time-linked social uses and habits, e.g., when having a telemeeting on a weekend vs. weekday, or during a local festivity. However, there is some body of knowledge on the complex relation between temporal changes of the system characteristics and the QoE formation processes, see Section V-B4. These considerations address a different aspect of time. Another aspect regarding time is the conversation structure of participants during a meeting, which is discussed in Section V-A. VOLUME   : Two-dimensional circumplex model of group tasks -adapted from [108] and previously presented in [28] 2) CIFs related to the Communication Scenario Next to the communication environment and time-related aspects mentioned in the previous section, additional QIF refer to how many sites are connected, how many participants are situated at each site and how this would lead to possible mixtures between face-to-face and mediated conversations. These aspects have been taken into account as another subcategory of QIFs under the term Communication Scenario. However, the relevance of these aspects becomes more apparent when considering the communication processes in Section V-A, especially with respect to how a mixture between face-to-face and mediated conversation can influence the communication and in turn QoE. Another aspect is the relevance of recognizing the speakers in a telemeeting and being able to locate their specific position (see Section VI-D3).

3) CIFs related to the Telemeeting Purpose
In this paper, a telemeeting is considered to serve a certain set of purposes or goals. The QoE experienced by individual telemeeting participants is influenced by the participant's perception of the extent to which those purposes or goals could be reached. To encourage future exploitation of such knowledge, the network planning tool ITU-T Recommendation G.107 [115] is an example in which first considerations of purpose -at least indirectly -have been included: when it comes to the impact of delay, different network planning parameters are recommended, depending on whether the service is intended for scenarios in which high, medium or low sensitivities to delay can be expected.

a: Categorizing Telemeeting Purposes
One way to categorize possible telemeeting purposes is to differentiate them into accomplishing tasks, fulfilling social needs, and exchanging information. As there are many different possible tasks, work reported in the literature often uses McGrath's task circumplex [108] to further categorize group tasks. This model structures tasks into four categories (generate, execute, negotiate, and choose) along two dimensions (cognitive ⇔ behavioral, collaborate ⇔ conflict-resolution), see Figure 4. Examples of fulfilling social needs are telemeetings in which persons communicate to feel connected, to feel they belong to the same group, to get to know each other, etc.
Finally, examples for exchanging information are making announcements, distributing news, or sharing useful information for group members.
A complementary way to categorize telemeeting purposes is to differentiate between professional / business telemeetings and private / leisure time telemeetings. Such a, sometimes non-binary, distinction can help to characterize telemeetings in relation to the conversation partners and their behavior, with aspects such as the degree of formality or expectations concerning the meeting outcomes. This approach is complementary to the one above that addresses the specific purpose. Both professional and leisure-time telemeetings can aim for accomplishing tasks and exchanging information, and also fulfilling social needs can play a role, not only in leisure time but also professional telemeetings, e.g., for improving commitment of individuals to a team.

b: Distributed Collaborative Work as a further Telemeeting Purpose
When a telemeeting serves the joint accomplishment of one or several tasks in a professional context, a common term to characterize such a telemeeting is remote or distributed collaborative working. This touches upon the multi-disciplinary research field of Computer-Supported Cooperative Work (CSCW), which Schmidt [116] characterizes as research to understand cooperative work practices with the aim of contributing, both conceptually and technically, to the development of collaborative computing, i.e., computing technologies that facilitate, mediate, or regulate workers' interdependent activities.
For an overview of main research threads in CSCW, the reader is referred to [117].
Focusing on distributed collaborative working using telemeeting technology, one important aspect for an effective and efficient collaboration is instantiating a shared workspace. This term refers to a physical or virtual space that allows the collaborating persons to share and jointly manipulate information and objects, e.g., see [118]. A typical example of a shared workspace in the physical domain is a meeting room with a whiteboard. For a telemeeting, a typical example of a state-of-the-art feature to create a virtual shared workspace is screen sharing that shows a virtual whiteboard or presentation slides.
As shared workspaces can take quite different forms, a first feature to characterize them is a differentiation between physical (or co-located) and virtual (or distributed) shared workspaces. Next to that, Park [119] proposed two more features: visibility, i.e., the extent to which an owner of information is sharing the view with the others, and controllability, i.e., the extent to which an owner of information is sharing the control with the others.
Nowadays, features such as screen sharing or joint document editing are commonplace examples for shared workspaces in many working contexts. With the advent of XR technologies, virtual workspaces can go beyond this, as they allow to (re-)create more immersive environments. Here, new questions arise when it comes to the combination of real and virtual environments as well as the potential benefits, which has been addressed by many researchers (e.g., in [120]- [122]). With current advances in remote sensing and control technologies in the area of cyber-physical systems, even more complex mixed physical/virtual shared workspaces are possible, in which physical objects can be manipulated by remote telemeeting participants.
Another aspect is the number of people who can simultaneously access such workspaces. Already with today's technological advances, this number has reached values way beyond 100 participants. Massive Open Online Courses (MOOC), virtual conferences, or virtual conventions are typical examples used in e-learning, academic, and business contexts. Another example, which in addition crosses the border from collaborative working to science entertainment, is the virtual telescope, which combined and processed data streams from multiple real-world telescopes to create a real-time virtual experience of a sun eclipse in June 2020 [123].
With respect to QoE, shared workspaces have an impact in two ways. First, the QoE experienced by telemeeting participants might include perceptual features and cognitive constructs regarding the shared workspace as such, e.g., in terms of video quality or system delay, or general usability. Second, the degree, to which the shared workspace actually supports the collaborative working process influences the participant's experience of that process, and in turn of the overall telemeeting.

c: Fulfilling Social Needs as a further Telemeeting Purpose
In this paper, social needs refer to the human desire to form and maintain social connections with other people. This relates to the feeling of belongingness to a group of people [124]- [126] as well as to the feeling of being connected with members of that group. To form and maintain such social connections, people want to communicate by expressing their views or sharing their knowledge with others, and by seeking information and opinions from others. Here, telemeetings and social media are two technologies that allow such communication with people situated at remote locations.
On one hand side, social media platforms have the potential to fulfill the need of belongingness and -to some extent -even the need of feeling of being connected. Recent studies investigate this potential but also possible drawbacks of social media and its relation to face-to-face contacts, see, e.g., [127]- [129]. On the other hand side, telemeetings allow for real-time and speech-based communication, which means that they have the potential to create an intense feeling of being connected. Based on the media richness theory [29], it could be assumed that telemeetings can create an even richer feeling of belongingness. Some support exists that perceived social belongingness is higher in face-to-face interactions than what can be achieved with text messaging [130]. Also, an underlying, pre-existing group belongingness for participants was reported to lead to a better QoE [131]. It is noted that the authors of this paper consider videoconferencing fatigue, often synonymously referred to as Zoom Fatigue in the recent literature, see e.g., [132], [133], as a constituent within a more holistic concept of QoE (cf. Section V-B). In turn, recent findings have challenged the assumption that videoconferencing may be preferred over text-based interaction, for the example in case of compensating for social distancing as required during the Covid-19 pandemic [134], [135].
Another aspect concerning the fulfillment of social needs by means of telemeeting technology is the feeling of copresence [136], [137], i.e., the feeling of being there with the other person(s), or "a sense of being together in a shared space at the same time" [138], [139]. Another related term is that of social presence, i.e., "the sense of being together with a virtual or remotely located communication partner", which implies the feeling of co-presence and being in a communication with the other persons [138]- [141]. Here, a distinction may be made between group belongingness at large, and interpersonal bonds, where group belongingness may be achieved even with less rich information, while social presence in terms of interpersonal bonds can be increased by more face-to-face like cues, according to the work by [142] on distributed learning. A lot of research is ongoing in this area and can be expected to expand in the context of immersive media and Virtual Reality (VR) and Augmented Reality (AR) technologies, see e.g., [143]- [146].

C. HUMAN INFLUENCE FACTORS (HIFS)
According to [2], [5], Human Influence Factors (HIFs) refer to any characteristics of a user that have an influence on QoE, including the background and the mental, psychophysiological and physiological state of a user.
At first glance, HIFs refer to the person who is experiencing a multimedia system. When it comes to human-tohuman communication over a telemeeting system, however, not only the HIFs of individual "experiencing person" are relevant, but also additional HIFs that relate to the other participants, their individual conversation behavior, as well as the relations between all participants. Based on these considerations, Table 4 provides a list of HIFs relevant in a telemeeting, grouped into the following sub-categories: Characteristics of the perceptual and cognitive processes, Internal state of individual participants, Conversation behavior, Relations between participants, and Language aspects.   [191] Listener's suppression of back channel signals [192] [193] Adaptability of communication behavior [192], [194 Language and body language aspects Mixture of native and non-native speakers [198] Mixture of body languages - [173] Notes: The column Quality Relevance cites either empirical studies directly investigating the factor's effects or publications discussing the relevance more from a theoretical point of view. Moreover, the relevance can refer either directly to QoE or to perception or communication aspects, which in turn are relevant for QoE. When no references are given, the factor is considered to be relevant from a practical perspective and requires further study. The column Background provides pointers to further literature.
HIFs strongly relate to the characteristics of the user's perceptual and cognitive processes [2], [5]. For example, impaired visual or hearing acuity will influence the perception of any degradations in the audio and video signals. When it comes to the list of HIFs in Table 4, the question arises to which level of detail these characteristics should be included.
For example, there are many different possible forms of impaired visual acuity or hearing loss. Moreover, it is not clear in which way details about individual differences of the cognitive processing abilities of people, see, e.g., [150] can be taken into account either. For that reason, only the following, more global descriptors are used in this paper as HIFs: vision acuity, hearing acuity, olfactory acuity, tactile acuity, cognitive processing abilities. Here, the reader is also referred to Section V-B, which looks at the respective QoErelevant perception and cognitive processes in more detail.

2) HIFs related to Conversation Partners and Conversation Behavior
One broad set of HIFs that are particularly relevant for a telemeeting refers to the participants, and more specifically to their communication goals and skills (including language and body language aspects), their individual mental state and personality as well as the relations between the different participants. These aspects influence the conversation behavior of the participants, for example regarding the amount of contributions of individuals, the way in which those contributions are made by the individuals and received by the other conversation partners, and the way in which the overall group conversation as such is managed. As the conversational behavior of participants influences the overall conversation structure and the communication processes (see Section V-A), the aspects discussed here are also relevant from a QoE point of view. One example is the finding that in certain conditions of transmission delay, active speakers are rated quality differently than passive listeners [186]. With respect to communication goals, the individuals' intentions and their positions in terms of knowledge and attitudes to the subject at hand influence the participants' communication behavior or the communication processes as such, see, e.g., [107]. In addition, the individual intentions can also influence the QoE formation process of that individual. More detailed discussions on this aspect are given, for instance, in [166] on the contribution of knowledge and attitude to the experiencing process and [151], [164] for observed links between attitude and QoE. Here, from an engineering perspective, it may be possible to infer the attitude from behavioral analysis, for example using conversation analysis, possibly even at a surface level, e.g., [199]. More details about the conversation process are given in Section V-A6.
With respect to communication skills, the individual's overall and momentary capabilities and willingness to cope with challenges of the discussion at hand (e.g., required cognitive load) as well as the system characteristics (e.g., lack of backchannels due to muted microphones) contribute in two ways. On the one hand, these aspects affect the participant's QoE as such, e.g., in terms of a discomfort due to a perceived lack of backchannels, when the participant is not used to it, or due to the impact of a required high cognitive load [32]. On the other hand, these aspects can influence the individual's communication behavior and thus also the experience of the other participants.
Similarly, the individual's internal state and personality are additional factors influencing communication behavior and QoE. The relation of emotion and communication is intensively discussed, for instance, in [155]. An impact of emotions or stress on QoE has been found for example in [151]- [153], [161]. With respect to personality, Schoenenberg et al. [197] found, for the case of transmission delays, that the personality that users perceived from other participants was linked to measures characterizing the conversation surface structure. Looking at personality from another perspective, Scott et al. [200] investigated the role of personality and cultural background on QoE. Obviously, if personality traits are perceived differently depending on the telemeeting system properties (e.g., [197]), it highlights the need for a holistic QoE assessment, beyond a mere audiovisual signal quality. If users do not use certain telemeeting platforms because the interaction with others is perceived as sub-optimal, even if not attributed to technology, the impact on technology acceptability will be just as bad as when the QoE-related issues are more explicitly attributed to the "communication channel".
Further aspects such as status and roles, trust, acquaintanceship and mutual expectations from each other as well as cultural aspects can determine the communication behavior between telemeeting participants. Here, studies on the automatic detection of roles, such as [195], [201] may be starting points for further analyses on the impact of roles on communication behavior and QoE. Finally, conversation management aspects such as moderation, agreed upon rules or degree of formality, are further factors. A framework for structuring the impact of roles and rules on conversation management is proposed in [107].

3) HIFs related to Adapting User Behavior to the System and Context
Next to the considerations discussed above, there is a further type of participant's behavior in a telemeeting that is of particular interest from a methodological perspective: the users' tendency to adapt their behavior to the technical system characteristics and the context of the telemeeting. On the one hand, this refers to any adaptation of the conversation behavior depending on the system's capabilities and limitations as well as on the overall telemeeting context. On the other hand, this refers also to the topic of user-system-interaction, which was already mentioned in Section IV-A2.
From today's perspective, one general drawback of the Media Richness Theory mentioned in Section IV-A2c is that it places face-to-face communication as the richest commu-VOLUME 4, 2016 nication medium, which inherently means that face-to-face communication is the optimal way. This, however, is highly task or use-case dependent. While face-to-face meetings are definitely optimal for social interaction, they may be far less effective for decision making procedures or formal meetings. For instance, video access may impede the development of prosodic synchrony when some communicating partners display visually salient social cues, thereby dominating the conversation. In such conditions, communication via audioonly channels can be more effective in synchronizing speaking turns [202]. Over the years, several studies have shown that mediated collaboration can lead to similar or even better performance than face-to-face collaboration. This is for instance confirmed by a series of comprehensive literature reviews on decision support systems, which did not show a clear preference of face-to-face over mediated communication [14]- [17]. To account for such effects, Hantula et al. [31] proposed the Media Compensation Theory, which addresses the observation that humans actually adapt to electronic communication media; and Kock [203]proposed Media Naturalness Theory as a complementary approach by taking a behavioral perspective towards the use of electronic communication tools.
With respect to the topic of this paper, the degree to which participants are willing or able to such adaptation will influence QoE, both for them and for their conversation partners. This strongly relates to the individual's experience with the communication modality as well as the person's understanding of the system's capabilities and limitations. As an example of an effect on the participant: if the participant is not used to multiparty audio-only calls, that participant will experience a high cognitive load from the telemeeting, which in turn reduces the QoE. As an example of an effect on the others: if an inexperienced participant is too far away from a microphone to be adequately captured, the other participants will perceive a lower QoE, as the speech signal of that participant will sound degraded. In practice, communication between participants about such behavior-or usage-related problems often solves the issue, by accordingly adapting the technology usage.

D. MIXED INFLUENCE FACTORS (MIFS)
Next to the SIFs, CIFs and HIFs discussed in the previous sections, additional QIFs can be assigned to combinations of factors from the three main categories. Those factors refer to characteristics that are shared by two or all three of the main categories. An overview of those Mixed Influence Factors (MIFs) is given in Table 5.
Due to the large diversity of the MIFs, the following text discusses only a few examples that may be of particular interest, which are those factors that concern the interfaces between the physical environments at each site and the system. For more information about the remaining factors, the reader is referred to the references in Table 5 and to the supplementary material in [25], [26].
Looking at factors concerning the environment-system-interfaces, the first type of factors relates to the characteristics of the end devices: the signal transduction between the environment and the system, i.e., the electro-optical and electroacoustical transduction, addressing, for example, the impact of background noise or ambient lighting. Further factors concern the extent to which a representation of the physical environments and of relevant objects in those environments as well as any communication-relevant side information is included in the transmitted signals. These factors concern mainly characteristics of the system and the context. However, there is an additional group of factors concerning the environment-system-interfaces which also brings the human into the game: the positioning of the participants relative to the system components, and in particular to the capturing and reproduction devices. These types of MIFs are especially relevant from a QoE perspective. For example, non-ideal positions of speakers with respect to the microphones can lead to low QoE for the listeners, due to a reduced sound level, distance-induced coloration (due to the reduced level and high-and low-frequency audibility as well as the reduced direct-to-reverberant sound ratio), while optimal positions of viewers with respect to the displays can enhance QoE. Despite the QoE relevance of these positioning factors and their consideration in formal QoE test scenarios, see e.g., [23], [32], [74], [95], [113], [222]- [226], it is difficult to systematically address these factors in real-world settings. The reason is, that these factors are determined by a mixture of system, context and human aspects. This mixture could consist of limitations of the system, e.g., due to specific end devices used, constraints of the context, e.g., due to the interior of a room, and human behavior, e.g., with respect to the participant's awareness and willingness to change their position if that could improve overall QoE.

V. SURVEY ON QOE-RELEVANT PROCESSES CONCERNING TELEMEETINGS
After having discussed the large body research on QIFs, this section changes the perspective and looks at a number of communication, perceptual and cognitive QoE formation processes that are relevant for telemeetings. In that respect, it should be noted that this paper is touching on this field mainly from an engineering perspective and accordingly uses an engineering-type approach for describing the processes, e.g., by using flow diagrams. For that reason, this remark should be considered as a disclaimer in the sense that in other disciplines such as biology, neuroscience, psychology, or communication sciences, different descriptions are preferred.

A. COMMUNICATION PROCESSES
The primary purpose of a telemeeting is to communicate. Hence, the way in which the communication takes place is obviously a main contributor to the QoE perceived by the telemeeting participants. There is a vast amount of literature on human-to-human communication, both for face-to-face and mediated communication. In this paper, we focus on a number of aspects that have been considered in previous Focal assurance (certainty about who is talking) [32]- [35] [35] Mental model of the common (virtual) communication environment [208], [228] [227] Notes: The column Quality Relevance cites either empirical studies directly investigating the factor's effects or publications discussing the relevance more from a theoretical point of view. Moreover, the relevance can refer either directly to QoE or to perception or communication aspects, which in turn are relevant for QoE. When no references are given, the factor is considered to be relevant from a practical perspective and requires further study. The column Background provides pointers to further literature. VOLUME 4, 2016 work with regard to QoE. First we present four inter-personal communication processes, i.e., processes that take place between the conversation partners: Conversational Games, Grounding, Turn-taking, and Using Back-channel Signals.
After that, we discuss two further intra-personal communication processes: Understanding and Response Formation. Finally, this section closes with information on Conversational Flow and Conversation Structure; two concepts that help to characterize the degree of successful communication processes.

1) Conversational Games
Conversational games refer to parts of a communication that serve the accomplishment or alternatively the abandonment of a certain goal. Conversational games form a first step for separating a conversation into smaller units, as a conversation can consist of one to several conversational games. Conversational games have been introduced as a method to systematically characterize parts of a conversation with respect to the communication purpose, because they represent the "pragmatic functions of utterances with respect to achieving speakers' goals" [229]. More specifically, conversational games can be further separated into one or multiple conversational moves, i.e., utterances, which can be classified according to their purpose. In the literature, a number of coding schemes for conversational moves have been proposed [230]- [233], which were merged into a joint scheme in [28,Chapter 2]. To conclude, conversational games and moves allow the analysis of more complex conversations with multiple phases and even multiple communication purposes. Future work has to show, how this can also help in analyzing the QoE of telemeetings, which may be characterized by a set of complex conversations.

2) Grounding
Grounding [234] describes a process of establishing a mutual belief between speaker and listeners that an information has been correctly understood, i.e., that a common ground has been achieved. More specifically, Clark and Brennan [234] describe that this grounding process consists of a presentation phase (speaker's utterance) and an acceptance phase (listener feedback whether they understood the message or not). That means, it can take one or more turns until the grounding process for a particular message is completed. In the authors' view, the grounding process plays an important role in the user's QoE of a telemeeting. To start with, understanding the grounding process allows a quite analytic perspective on the potential impact of system characteristics on the conversation flow in a telemeeting. As Figure 5 explains in more detail, grounding in mediated communication may take a number of steps which can alter the original information that one person wants to convey up to the information that is actually understood by the other person. In the example shown in the figure, two persons A and B communicate, with their messages M A and M B in different phases of the process being altered, or the information I A or I B extracted from it.
Those alterations can happen during the persons' perception, understanding and response formation processes as well as due to the communication medium. Here, it is the degree of such alterations that determines how much effort and how many turns participants need to spend for achieving the common ground: the stronger the alterations, the more effort and turns are necessary.
Following the argumentation in [234], it is commonly accepted that conversation partners usually have an intrinsic desire to reach common ground. Thus, any disturbance in the grounding process is assumed to have some negative impact on the perceived conversation and thus on QoE. In that regard, a disturbance can mean that the grounding process requires more effort than usual, or that the process as such is even temporally disrupted. Both technical and non-technical reasons can cause such disturbances. In turn, any means that ease the grounding process can increase QoE.
Finally, grounding is strongly connected with the other concepts considered in this section: the turn-taking process and back-channel signals described below are major components of the grounding process, while grounding with its specific purpose of reaching a mutual understanding can even be seen as one type of conversational games.

3) Turn-Taking
Turn-taking refers to the transitions between speech utterances of each conversation partner, i.e., it describes who is speaking when, and how a change of speakers is accomplished. The fundamental principle of the turn-taking process is described in a model proposed by Sacks et al. [9]. With this pivotal work, Sacks et al. have made a foundational step to what today is called conversation analysis. This model considers speaker turns as a composition of turn construction units followed by transition-relevance places: Turn construction units are sentential, clausal, phrasal, and lexical constructions; one can understand them as units that carry information. The transition-relevance places are the moments at which a continuation of the current turn or a speaker change may occur; one can understand them as units that are used for signalling the temporal organization inside a conversation. Therefore, any impact on this process leads to an impact on the conversation flow that can be considered as a mediator to QoE. For instance, ITU-T Recommendation P.1305 [174,Sec. 8] describes how transmission delay can impact the turntaking process: Consider that speaker changes may occur not only by explicit hand-over from the current speaker but also by self-selection from the listeners. Then, transmission delay can cause that listeners are "missing" the transition relevance place, which in turn disrupts the self-selection. Especially in a multiparty scenario, this can lead to severe false-start problems, meaning that multiple listeners attempt to get the turn, interrupt each other and need several attempts to sort out who can continue with the next turn.

4) Using Backchannel Signals
Backchannels [172], also referred to as listener responses [173], are signals from the listener to the speaker to continue the turn. These signals can be produced vocally using verbal or non-verbal expressions, for example, utterances such as "mm", "uh uh", "right", "okay", and "yes". Or these signals can also be sent by means of facial expressions, gestures, and posture changes, e.g., straightening the upper body part, head nods, and establishing mutual eye gaze. As these backchannel signals support the turn-taking process, obviously, the system's ability to transmit these signals can strongly influence the conversation flow. Moreover, the lack or degradation of such signals can cause a non-pleasant experience for a speaker in a telemeeting, as the following example from practise sketches: Especially in multiparty telemeetings, it is quite common that participants mute their microphones to avoid unnecessary noise. However, due to this absolute silence the speaker has no information, whether the other participants are still following or not, which for instance can cause feelings of uncertainty or which can even trigger the speaker to stop the turn and request any feedback. Notice that many state-of-the-art telemeeting systems provide signalling features such as hand raising or thumbs up, which can be considered as additional ways of sending backchannels to the speaker or initiate a turn.

5) Understanding and Response Formation
After discussing some essential interpersonal communication processes, this section focuses on communication processes within one person. In this paper, those processes are discussed in two main stages, Understanding and Response Formation. This links also to the considerations regarding the Grounding process sketched in Figure 5: to achieve a common ground, a listener needs to attempt to understand what the speaker was saying, and the listener needs to formulate some response to signal back whether the message is understood or not. Having a closer look at Understanding, one can differentiate two levels of an achieved understanding: intelligibility, which refers to the understanding of the spoken words or full sentences from the acoustic signal, and comprehensibility, which refers to the understanding of the meaning in a larger, pragmatic application context. The degree to which an understanding in terms of intelligibility can be reached depends on numerous aspects, such as the listener's hearing capabilities, the listener's fluency of the language, the speaker's pronunciation and articulation, and the signal quality of the acoustical signal, which in turn is influenced by the system and the speaker's and listener's environments. Some work on the relation between intelligibility and quality has been reported in the literature, which will be discussed in Section VI-D2.
The degree to which an understanding in terms of Comprehensibility can be reached depends on the listener's world knowledge and in particular on the knowledge about the current topic domain and context. In addition, knowledge about the speaker and his or her intentions ("What does he or she want to express or achieve when saying this?") helps to assess any consequences that can be drawn from the message, which is a crucial aspect of meaning extraction.
Moreover, language fluency of both speaker and listener can also strongly affect the degree of understanding. On the one hand, language fluency can help to improve intelligibility in case of degraded speech signals, as the listener can rely on his or her knowledge of the language in order to fill in gaps in the received speech signal, see e.g., [235]. One the other hand, language fluency can impact comprehensibility to such an extent that the perceived personality can be affected as well in certain contexts, see e.g., [236].
Having a closer look at Response Formation, the person not only takes the understood message into account but also other aspects. Further, world knowledge, and here in particular knowledge and assumptions about the speaker and any other telemeeting participants, will influence the content and form of the response ("Is it fine to formulate a short response or is a longer explanation necessary? Is it fine to respond in a more direct and emotionally neutral manner or is it better to react in a more empathetic way?"). In addition, the person's own intentions play a role as well.
Apparently, not only the person's world knowledge in general, but specifically the listener's knowledge and assumptions about the speaker and other telemeeting participants are important factors for the two processes Understanding and Response Formation. In the literature, this has been considered especially in the context of perspective-taking during the Grounding process, see e.g., [237]. This means, the degree to which a listener can recognize a speaker over the telemeeting system is an important aspect, see Section VI-D3.
To summarize, ensuring good Intelligibility and Comprehensibility are of paramount importance, as they not only determine the participants' QoE but are also crucial input for the participants' Response Formation and thus for their general communication behavior (see below). Therefore, these aspects deserve the maximum attention, in particular to understand the impact due to the sound devices (microphones, receivers, amplifiers) and transmission tools, but also to the speaker and listening environments.

6) Characterizing Communication Processes: Conversation Flow and Conversation Structure
Conversation flow refers to the efficiency and smoothness of the communication. In other words, the smoother and more efficient the communication processes Conversational Games, Grounding, Turn-Taking, and Backchannels are taking place, the better the conversation flow, and in turn the better the QoE. There are multiple aspects that can influence the conversation flow or one or more of the described communication processes. These aspects can stem from any of the three main categories of QIFs, for example when a speaker has a limited experience in coping with lacking Backchannels in terms of a HIF, a non-optimal mixture of face-to-face and mediated communication as a CIF, possibly mediated by technology and hence SIF, or a significant endto-end transmission delay as a SIF.
Conversation structure can be analyzed at two different levels. On a first level, an analysis of conversation structure targets the components of a conversation in terms of their function during the communication process. When considering speech, such components are the individual utterances of the conversation partners. This perspective comes from the discipline of Conversation Analysis (e.g., see [9], [238], [239]), which builds on the analysis of turn-taking and repair processes (e.g., [9], [211]). As already mentioned above, the effect of disrupted turn-taking processes on telemeeting QoE is sketched in [174], for example regarding false start problems after interruptions due to transmission delay. However, future work is necessary to better understand the relation between QoE and conversation structure as a result of conversation analysis.
On a second level, the conversation structure can also be analyzed with regard to the sequence of on/off speech patterns in the conversation, irrespective of their function or content. This approach is also referred to as Conversational Surface Structure Analysis. Introduced in [240], [241], the principle is to describe the conversation structure as a temporal sequence of states in which no, one, or multiple speakers talk simultaneously. The advantage of this method is that -at least for speech-based analysis -it is rather straightforward to implement by means of voice activity detection algorithms. With this simplified analysis of conversations that does not require any speech recognition, such state-based surface structure models have also been investigated in a multimodal analysis of conversations, e.g., [242]. Seen from a probabilistic perspective, Conversational Surface Structure is usually modelled as a Markov chain in which the steady-state and transition probabilities are obtained from observations. In [243], this approach has been used to characterize the effects of transmission delay on telephone conversations by computing statistical measures from a corresponding Markov model. Later, this approach has been further developed and extended with additional measures in the context of QoE evaluation of transmission delay [78], [79], [174], [185], [244]- [246]. Here, state probabilities and sojourn times, but also transitions between states at the different ends of a two-or multiparty communication can be used as sources of information, revealing, for example, unintended interruptions that may occur in case of delay, whether participants adapt their conversation behavior to delay, and whether the delay may be noticed as a QoE degradation and attributed to the system (cf. e.g., [78], [79], [199]).

B. QOE FORMATION PROCESS
Referring back to the definition of QoE in Section II-A and building on the fundamental work on quality perception in [6], it becomes clear that QoE happens largely in the user's mind. To better understand this perspective, this section takes a closer look at the processes inside the experiencing person. This leads to a more holistic understanding of telemeeting QoE, which in turn could help in technical system development. Note that in test contexts, the experiencing person is usually referred to as test subject, in real-life telemeetings that experiencing person is usually referred to as participant.

1) QoE-Relevant Processes within the Experiencing Person
In the literature, a number of principles, taxonomies and models have been proposed to describe the formation of QoE or the link with related concepts such as Quality of Service, Quality Perception, and Quality Assessment, see This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  e.g., [6]- [8], [92], [95], [247]- [250]. The motivation for such work is, for example, to provide insights on human quality perception that can help to improve technology -similar to approaches that exploit knowledge of human auditory or visual perception in coding -or to form the basis for instrumental quality assessment algorithms.
From an engineering perspective, there are two principal kinds of processes involved to form telemeeting QoE: those that process information, here referred to as QoE-relevant Information Processing Mechanisms; and those that steer the information processes, here referred to as QoE-relevant Steering Mechanisms. As the processing and steering mechanisms are tightly integrated in the human's mind, such a depiction only serves to illustrate the different components from a simplified, systems theoretic perspective. Figure 6 gives an overview of the main processes, which are described in the following.
First, let us focus on the Information Processing Mechanisms. Starting from the outcome of these mechanisms, Telemeeting QoE is the result of a QoE Formation process, which in turn consists of a number of sub-processes (details shown later in Section V-B2). In previous models (e.g., [8], [250]), this QoE Formation process takes as input the results of a Perception stage. Considering the state-of-the-art in perception research (e.g., [251]), this step is considered to be a pull mechanism (see also [252]). As a higher-level cognitive process, the QoE Formation process is taking the relevant information, such as the perceived characteristics of the audio and video signals, from a pool of Perceptual Features, with the pool in turn being filled by the underlying sensory perception processes.
As a novelty compared to earlier publications on QoEformation processes, this paper explicitly considers another process as highly relevant for QoE Formation in a telemeet-ing context: Communication. Obviously, this Communication process is the most central cognitive process during a telemeeting, and it consists in itself of the two sub-processes Understanding and Response Formation, see the previous Section V-A. With respect to QoE Formation, the Communication process also takes Perceptual Features as input; however, the type and weight of information used by the QoE Formation and Communication processes may differ. Moreover, the QoE Formation process may also take as input some information specific to the Communication process, that is, the perceived characteristics from the Perception stage are augmented by further communication-related characteristics such as, for example, the conversation flow.
To complement the picture, an Action process is considered here as well. It accounts for the fact that an experiencing person performs different types of actions during a telemeeting: First, from perception research (e.g., [253]- [255]) it is known that people perform some actions to optimize the perception of a situation. An example for this is turning the head and eyes to the direction of a sound source to augment the auditory signal with visual information. Second, in a telemeeting a person obviously performs some actions to communicate: People speak and they also send other nonvocal communication signals (e.g., see [173]). Third, it has been observed that a certain QoE can trigger a person to different behaviors, see e.g., [248] for more details in video streaming contexts. This is also very common in telemeeting contexts; a typical example is that participants switch off video transmission when they experience quality problems with their network connection. Further, in case there are communication-related impairments such as for example background noise, interruptions in the audio channel, or a generally too low volume, users may speak up and apply Lombard speech (on the Lombard effect, see [256], [257]), ask a non-intelligible participant to repeat what she or he said, or increase the volume of their audio playout. For further listening-related measures see also [252]. Now, let us focus on the Steering mechanisms. The main process that can steer the previously mentioned Information Processing Mechanisms is Attention. Perception research has shown that attention is highly relevant during human information processing, as it enables people to tune the Perception process to individual signals, such as the speech signal of one particular conversation partner, resolving the cocktail party problem [258], [259]. This way, the mental information bandwidth and processing workload is effectively reduced. The bottom-up component of attention is typically referred to as saliency (see e.g., [260]- [262] for vision, listening and QoE, respectively). In turn, attention can also be driven by top-down processes, for example when voluntarily attending to a specific conversation partner.
In the QoE domain, a similar impact can be asserted, as attention may be drawn or directed to individual QoE-relevant characteristics. For example, this may be the blurriness of a video signal from one particular conversation partner who is currently talking. Last but not least, Attention impacts the Communication processes as well, for example, by focusing and reacting on specific information.
Next to the aspect of Focus, Attention is also related to another sub-process: Awareness. The motivation for this is that attention needs a trigger. Such triggering may be based on information stemming from the Perception process: When the process fails to fully match the perceived, bottom-up sensory information with a number of top-down hypotheses [263], the person is becoming aware that something is missing or wrong, which in turn can trigger the person to pay attention and focus. Those mentioned hypotheses are stored in the person's memory and are referred to as internal references; the corresponding iterative sub-processes performing this hypothesis testing are referred to as Anticipation & Matching as a top-down process, and Perceptual Event Formation as a bottom-up process, see e.g., [8], [252], [28,Chap. 5].
Along with Attention, which is directly steering the different processes, further information is retrieved from the person's memory as additional input to the different processes. The Internal References already mentioned not only play a role during Perception but also in the QoE Formation process, when it comes to the formation of the desired quality features, see e.g., [6], [8], [28,Chap. 5]. Here, Expectations as well as Prior Experiences with the same telemeeting system or with previous similar telemeetings are taken into account. The terms Internal References and Expectations are often used in a similar way in a QoE assessment context. Internal references and expectations are assumed to result from Prior Experiences. According to Jekosch [6] and based on Piaget [264], both accommodation and assimilation may be involved in reference formation, depending on whether reference schemata are adjusted to the perceptual representation, or the perceptual representation to existing schemata, respectively. A more dedicated view on QoE Formation and Expectations can be found, for example, in [265]. For instance, Expectations may stem from other sources, such as costs or the particular situation, than from previous telemeetings, meaning that Expectations are not solely based on Prior Experiences and may instead be influenced by aspects such as advertisements, recommendations by friends, colleagues and family, or by reviews found on the internet or in magazines. Finally, during the Communication process, world knowledge enables the person to fully understand the perceived messages and to form appropriate responses.

2) QoE Formation Process in more detail
According to Jekosch [6], at the core of QoE Formation is a cognitive process during which a quality judgement is formed by comparing the perceived features of an entity with the expected, desired features.
In the past, a number of extensions of this process model have been proposed to achieve several goals a) to obtain a more detailed understanding of this QoE Formation Process and possible influences from inside and outside the person [  b) to explain a number of observed aspects, such as misattribution of technical quality problems to the interaction partners (see [7], [8] on process model extensions and [197] on the misattribution aspect), c) to account for evidence indicating that QoE is essentially a multidimensional and multilayered construct (see below), d) to provide theoretical models to explain this process in specific contexts such as the perception of asymmetries in multiparty meetings [28], [250] or when changing the viewing behavior in video streaming scenarios [248], e) to embed this into broader characterization schemes of quality beyond the perception processes (see e.g., [92], [95], [247], [249] as well as Section VIII-A).
Building on those considerations, Figure 7 shows the essential details of this process. Contrary to previous work, however, the figure puts an emphasis on the fact that QoE is considered to be an aggregated construct of multiple, interrelated and time-dependent aspects, which here are introduced as QoE Constituents. In each QoE Constituent Formation process, further sub-processes are shown that reflect Jekosch's main principle combined with the most essential model extensions listed above. The input to each QoE Constituent Formation process stems from the Perception and Communication processes. In earlier work [8], [248], [250], we referred to this input as the perceived character, which consists of a multitude of Perceptual Features.
In a first step, those perceptual features are transformed into Quality Features by a Reflection & Attribution process. During the Reflection phase, only those features are selected from the totality of the Perceptual Features that are QoErelevant. Here, the Attention process described before plays an important role. During the Attribution phase, any QoErelevant features are either attributed to the telemeeting or to something else, such as, for example, the environment or the conversation partners. Here, mental models [227] of the telemeeting and especially the telemeeting system play an important role. During this stage also the Desired Quality Features are formed by retrieving information from memory, which reflect Internal References and Expectations in light of Prior Experiences of the person (see the considerations in Section V-B).
With the Perceived and Desired Quality Features as input, Jekosch's principle of Comparison & Judgment is the core stage, which forms a judgment about the QoE Constituent. Finally, the judgments about the individual QoE Constituents are aggregated to form an overall Telemeeting QoE.
Looking at this from an engineering perspective, the QoE Formation process can be considered as a sort of multidimensional signal processing mechanism: a multitude of input features are transformed, weighted, selected, compared, and aggregated to a multitude of intermediate representations and an output. Section V-B3 addresses this aspect of the multidimensionality of QoE in more detail, by discussing the relation between Quality Features, Quality Dimensions, and QoE Constituents.
Next to the multidimensionality, another perspective is to look at the temporal relations between the aspects discussed so far. On the one hand, the characteristics of a telemeeting can change during a meeting and thus also the Quality Features, QoE Constituents, and Overall QoE can vary. On the other hand, different Quality Features and QoE Constituents may be formed at different time scales and they may even influence each other over time. Section V-B4 provides more background information on the temporal aspects of QoE.

3) Quality Features, Quality Dimensions, and QoE Constituents as part of the QoE Formation Process
In a typical approach taken in the literature -including our own prior work -an overall quality judgment is directly formed from a multitude of Quality Features, see top panel of Figure 8. As these features can be of quite different types, an approach has been presented in [278] to structure the features into four levels: level of direct perception, level of interaction, level of the usage scenario, and level of service. VOLUME   Immersion [175] [176]- [178] Simulator sickness [179]- [182] [183], [184] Feeling of Presence [182], [269]- [272] [175], [273], [139], [141] Feeling of Co-Presence [223], [274]- [276] [273], [277], [137], [141] Notes: The column Quality Relevance cites either empirical studies directly investigating the constituent in a QoE context or publications discussing the relevance more from a theoretical point of view. Moreover, the relevance can refer either directly to QoE or to perception or communication aspects, which in turn are relevant for QoE. The column Background provides pointers to further literature.
The present paper extends this concept by allowing for the formation of individual QoE Constituents, which then are aggregated to form an overall QoE judgment, see bottom panel of Figure 8. This extension essentially introduces an intermediate and visible level of aggregation: instead of directly aggregating a multitude of individual Quality Features into a single QoE judgment, the Quality Features are first aggregated into a set of QoE Constituents, which is then aggregated into an integral QoE.
The motivation for introducing these QoE Constituents is multi-fold. To start with, this still allows for the inclusion of research on multidimensional quality assessment, such as [51], [279]- [281], in which quality is considered to result from a set of orthogonal dimensions. These dimensions are extracted from a larger set of attributes and represent the underlying quality features. These dimensions may be integrated, for example, into audio or video quality, e.g., using preference mapping [279], [282]. Here, uni-modal media quality represents a QoE constituent.
Hence, the concept goes further than a solely dimensionand quality-based approach. First, different QoE Constituents need not be orthogonal since they may depend on common quality features. Second, QoE Constituents can encompass aspects that are not directly linked to speech, audio, or video signals, as it has so far been the focus of multidimensional quality assessment. Instead, also other QoE Constituents can now be considered in this framework, such as for instance simulator sickness, immersion, or fatigue, see Table 6.

4) Temporal Aspects of the QoE Formation Process and the notion of QoE Streams
When it comes to temporal aspects of telemeeting QoE, the picture in Figure 9 is rather complex: the involved perception and cognition processes run in parallel, and there are different levels at which temporal changes can occur. On the first level, the technical and non-technical telemeeting characteristics are usually subject to changes over time. Audio and video signals per se are functions of time. Next, network and connection characteristics may change, and the system may respond to that with a certain behavior. In addition, participants may change their communication behavior; they may use different additional system features such as shared workspace or chat at different moments; or they may interact with the system interface a number of times during a telemeeting, etc.
On a second level, the participants' perception of the telemeeting is a function of time as well. At the level of perceptual processing, an example for temporal effects in auditory perception is temporal masking, see e.g., [283], [284]. At a higher level, auditory and visual objects and other perceptual features are formed based on the telemeeting characteristics captured by the human auditory and visual systems. Note that feedback mechanisms initiated at higher level may evoke topdown information that influences the bottom-up processing during auditory scene analysis [263]. However, there is not a strict one-to-one mapping of the temporal characteristics between the sensory input and the formed auditory and visual

Attention = f(t) QoE Formation Process = f(t)
QoE Streams = f(t) objects. For instance, in auditory perception research it is known that either single or multiple auditory objects can be formed from a multitude of short signal parts, and that this depends on the temporal and spectral characteristics of the acoustic input. This leads to the concept of auditory streams [285], or perceptual streams as a more general term, which allows to account for such temporal dependencies and effects. Next to these aspects, perception is also strongly influenced by attention, as discussed in Section V-B1. A person can change his or her attention focus between different perceptual streams at any moment in time.

Physical, Sensory & Cognitive
On a third level, the QoE Formation processes are also a function of time. First, the formation of Quality Features and QoE Constituents (see Figures 7 and 8) is based on perceptual information, which is temporally changing. Thus the internal states and outputs of the QoE Formation process are timedependent as well. Moreover, attention plays a role here, too, with users focusing on specific quality features at a time, or weighting these in a certain manner, see e.g., [8], [28,Chap. 5]. Hence, the QoE Formation processes as such can be influenced over time. In analogy to perceptual streams, this paper proposes to use the term QoE Streams when referring to the temporal evolution of the QoE Formation processes. For example, specific impairments identified in the audio or video signals of different participants may form such a QoE Stream, or the depiction of a screen share by one of the participants.
In addition to attention, action is another factor that contributes to the complexity, as the person's actions are functions of time that influence the perception and QoE Formation processes and vice-versa (e.g., [253]- [255]), as well as the telemeeting as such.
Next to such theoretical considerations, temporal aspects of QoE have also been empirically investigated for both momentary and episodic changes of signal quality, see e.g., [86], [286]- [288]. Moreover, some audiovisual quality models for non-communication-type media, such as ITU-T Rec. P.1203.3 [289], [290] for HTTP-based adaptive streaming contain specific considerations on temporal integration for the auditory and visual modalities. Similarly, the work in [291] has pointed to corresponding effects, where a basequality was perceived by users when viewing audiovisual material at home, considering packet loss artefacts as additional impairments, as a sort of separate stream. Moreover, as this paper considers also other QoE Constituents than perceived signal quality, additional temporal aspects such as those of simulator sickness [184], presence [292], cognitive load and working memory, video conferencing fatigue [1], [131], [132], or usability and user experience, are also relevant for the formation of telemeeting QoE.

C. HOW QIFS AFFECT QOE FORMATION
It is obvious that the different steps of the QoE formation process may be influenced in different ways by the QIFs discussed in Section IV. VOLUME 4, 2016 The most straightforward impact is that a QIF directly influences the sensory and cognitive processes within the experiencing person. For instance, a person could focus on certain Quality Features, when the person has a certain goal in mind, such as choosing a conferencing tool for a given purpose, or during a meeting, when she/he is in a certain emotional state. Or, the person might be distracted by events occurring as part of the context of use, e.g., during mobile use compared to stationary use in the office or at home. Or a person might not be very critical about the video quality because the person has some lower visual acuity, but is not wearing glasses or lenses. Another possibility is when a QIF has an impact on the telemeeting as such, for instance, in terms of achieving the meeting goals or having a good conversation flow, etc., which in turn has an impact on the perception of the telemeeting's QoE.

VI. SURVEY ON STATE-OF-THE-ART IN QOE EVALUATION OF TELEMEETINGS
In the following, the surveys on QIFs and communication and QoE formation processes are complemented by an overview of "subjective" and "objective" test methods for media quality and QoE evaluation. It is well known in the field that a QoE evaluation of a system is a nontrivial task, given the numerous QIFs that are relevant but not part of the system under test [326]. For that reason, a typical approach is to follow standardized test protocols to control such QIFs to a certain degree, or to explicitly include specific QIFs in the subsequent data analysis, as in the case of crowd-sourcing or outside-the-lab testing [327]. In this respect, the usage of standardized methods ideally ensures the reproducibility and comparability of the assessment results.
The next sections first provide a survey of the two main categories of available evaluation methods: (a) perceptual test methods, often referred to as subjective quality evaluation, and (b) instrumental methods, often referred to as objective quality evaluation. Then, some guidance is provided for the selection of a QoE assessment method that optimally matches the test case at hand. Finally, some complementary approaches to the QoE assessment of telemeeting systems are discussed.

A. PERCEPTUAL, SUBJECTIVE QUALITY EVALUATION
In perceptual tests, participants are invited to carry out, in a specific test context, certain tasks with the system under test. At certain times specified in the test standard, ratings of media quality or other measures related to QoE are collected, according to the specifics of the test protocol. In ITU-T Recommendations, for example, the test context, tasks, and methods to collect QoE-related ratings are often referred to as independent test factors, which can differ a lot between individual methods. Table 7 gives an exemplary overview of such aspects for a number of well-known perceptual test methods that are relevant for telemeetings. These methods can be considered as more conventional, direct perceptual test methods, as they ask test participants to give quality or other types of QoE-related ratings using a rating scale. The most prominent measures of quality obtained from such rating scales are Mean Opion Scores (MOS). For a precise definition of MOS and related terminology see [328].

B. INSTRUMENTAL, OBJECTIVE QUALITY EVALUATION
Contrary to the perceptual tests, instrumental evaluation approaches do not require the input from test participants to obtain a QoE rating about the system under test. Instead, instrumental approaches use an algorithm to predict media quality or other QoE-related aspects as they would have been rated by participants in a very specific test situation, and according to one of the previously mentioned perceptual test methods. Here, usually the average rating obtained from a group of participants is estimated, which is referred to as the Mean Opinion Score (MOS) [328]. The performance of such standardized QoE prediction models is usually validated in a rigorous manner within the standardization group, and in most cases based on validation test data previously unknown during model development. Nonetheless, such validation can be carried out only for a specific set of test factors according to the perceptual test methods that the prediction models are based upon. The underlying test methods, among other aspects, determine the modality of the predicted quality (speech, audio, video), and whether the quality prediction is for a noninteractive (listening-/ viewing-only) or a conversation setting. Furthermore, instrumental approaches can differ in terms of input (ranging from system parameters for metadata models over bitstream information to the actual signals), and in terms of the usage of a reference for prediction (ranging from no-to reduced-to full-reference information being used, accordingly referring to the models as no-, reduced-or full-reference models). Table 8 gives an exemplary overview of QoE prediction models that are relevant for telemeetings. To the best of the authors' knowledge, those models have not been validated yet for different types of telemeetings and in particular not for multiparty settings, with the exception of Adel et al. [329], who investigated the performance of ITU-T Rec. G.107, the E-Model, [115] for codec tandems that occur in central-bridge-based telemeeting systems. Some concrete modelling ideas on how individual-channel model results could be employed for predicting a quality score for a complete multiparty meeting have been proposed in [330].

C. SELECTING APPROPRIATE EVALUATION METHODS
As a consequence of the different characteristics of the evaluation methods mentioned above, practitioners and researchers running a QoE assessment campaign need to opt for a test method that optimally matches the test case at hand. These test cases are often defined by system, processing and/or signal characteristics, as well as the use cases for which the telemeeting system under test has been designed. Next to the Tables 7 and 8, several pointers are available that may be used for finding an appropriate QoE evaluation method. ITU-T Recommendations G.1011 [331], and especially P.1301 [27] and P.1310 [332] provide concrete guid- Overview of the main test factors for an exemplary set of standardized, perceptual (also referred to as subjective) QoE assessment methods, which are relevant for telemeeting systems or their components. Note: This is an updated version of a similar table presented in [28,Chap. 4]. Further, more recent methods specifically addressing telemeeting assessment are considered in the text, such as the P.1300 Recommendation series developed in Question Q10 of ITU-T Study Group 12. Test Factor Perceptual (Subjective) Test Method ITU-T P.800 [111] ITU-T P.805 [293] ITU-T P.832 [294] ITU-T P.835 [295] ITU-T P.910 [113] ITU-T P.911 [296] ITU-T P.919 [297] ITU-T P.920 [298] ITU-R BS.1534 [299] ITU-R BS.1116 [112] ITU-R BT.500 [300] Test Modality ance to QoE assessment methods suitable for telemeeting systems. In addition, the interested reader is referred to the overview pages of the corresponding ITU-T [48], [49], ITU-R [333], [334] and ISO MPEG [50] standards to get up-todate information about standardized methods. Scientific texts such as [4], [20], [335] as well as text books and PhD theses on quality assessment of interactive telecommunication services (e.g., [7], [28], [51], [61], [82], [95], [246], [336] give further pointers to standardized and non-standardized methods that are relevant for telemeeting assessment. For surveys of QoE assessment methods for other services such as HTTPbased adaptive streaming, see, for example, [337]- [339] or generally for audiovisual multimedia, see [20].

1) Current Developments on QoE Assessment Methods
In the current state-of-the-art QoE assessment, additional aspects of QoE are increasingly moving into focus, and are assessed in different ways than using the conventional QoE-related, MOS-type rating scales. Highly relevant for telemeetings are approaches that look at the conversational structure, e.g., [174], cognitive load, e.g., [332], intelligibility of concurrent speakers, e.g., [340], and task performance e.g., [13]. Moreover, test methods for assessing 360 • video QoE beyond media quality have been developed [341] and are provided in ITU-T Recommendation P.919 [297], addressing, for example, simulator sickness and viewing behavior. Further approaches that help to assess the communicationrelated processes of Section V-A can be found in the large body of literature on intelligibility measurement and speaker recognition. The next two subsections outline the relation of these topics to QoE and provide pointers to relevant work.

2) Speech Intelligibility: Assessment and its Relation to QoE
Speech intelligibility is most commonly referring to word or utterance recognition in acoustic, verbal communication situations. The intelligibility of a spoken message depends on the speaker (e.g., articulation and speaking style) and listener (e.g., familiarity with the speaker's voice and the conversation context). Intelligibility is moreover influenced by the hearing abilities and the language profiency of the listener. Intelligibility varies with the quality of the speech signal's acoustic transmission and the availability of visual cues from the speaker. Although a gold standard for speech intelligibility measurement is not available, there exist a number of standardized speech intelligibility assessment methods and models, see e.g. [342]- [345]. VOLUME 4, 2016 Concerning the relation between speech intelligibility and speech quality, a first approximation is that good intelligibility is a necessary -but not sufficient -prerequisite for good quality, see, e.g., [4]. This means, low speech intelligibility will result in low quality, but high speech intelligibility will not necessarily lead to high speech quality. Focusing on speech distortions induced by packet loss, Schiffner et al. [346] looked into this relation more deeply and showed a highly non-linear relation between intelligibility and quality: for rather high intelligibility, quality judgements can vary substantially but are hardly influenced by intelligibility, while for low intelligibility, quality judgements are consistently very low. Looking at background noise, speech bandwidths and speech levels, Preminger and Van Tasell [347] showed a similar complex relationship: In an experiment in which intelligibility varied between stimuli, the subjects hardly distinguished between the measured variables intelligibility, effort, and loudness.
The complex relationship between intelligibility and quality is also of high interest in the field of speech enhancement algorithms, both in telephony and hearing instrument contexts. It appears that algorithms can improve quality but not necessarily intelligibility, e.g., [348], or that not all algorithms that improve intelligibility also improve quality, e.g., [349].

3) Speaker Recognition: Assessment and its Relation to QoE
The importance for a listener to recognize the speaker's identity, to be able to associate specific opinions shared in a telemeeting to individual speakers, and to be able to form some impression about the speaker's personality has been investigated in different QoE-relevant contexts. In the context of grounding, Fussell and Benimoff [237] for instance discussed the importance of perspective-taking, in which the speaker's attempts to take the listeners' background knowledge into account facilitates comprehension. Looking at cognitive load and the underlying memory processes, Baldis [35] investigated the benefit of spatial audio reproduction on the listerners' degree of recognizing what each of the individual participants said, referred to as Focal Assurance. Other work re-evaluated this study, e.g., [34]; or picked up the aspects of cognitive load and focal assurance and investigated them in conjunction with complementary speech quality assessment questions [32], [33].
In terms of perceptual assessment methods, the studies cited above on cognitive load and focal assurance used both direct ratings and memory tasks, which were later also included in ITU-T Recommendation P.1310 [332]. In terms of objective assessment methods, a large body of literature is dedicated to the task of automatic speaker recognition and speaker identification, see, e.g., [350]- [356] for recent overviews. This body of methods can serve as basis for linking automatic speaker recognition with instrumental QoE assessment similar to [357]. This could complement existing work on direct, speaker-independent quality predictions from speech signals such as [324], [358], [359].
Another body of relevant work looks at the perception of personality using either perceptual or instrumental assessment methods. On the one hand, work investigated and predicted the link between personality traits and speech signals, e.g., [360]- [363], or the listeners' ability to recognize speakers over quality-impaired telecommunication channels, e.g., [357]. On the other hand, work investigated the link between perceived communication behavior, for example as a result of transmission delay, and the perceived personality, e.g., [197], [364].

VII. SURVEY ON CURRENT TRENDS CONCERNING TELEMEETINGS FROM A QOE PERSPECTIVE
After the overview of telemeeting QoE assessment methods presented in the previous section, this section provides more insights on the question in how far today's QoE assessment methods already cover near-future telemeeting systems. For that reason, this section looks at a number of relevant technological developments and, with a focus on XR-based telemeeting systems, it discusses a number of challenges concerning the QoE enabled by such systems as well as the corresponding QoE assessment methods.

A. FROM PLAIN OLD TELEMEETINGS TO EXTENDED REALITY (XR) & SOCIAL XR
Despite the progress in the past few decades, existing telemeeting solutions still have a number of drawbacks and restrictions that limit the users' communication experience. In this section, we revisit some of the QIFs discussed in Section IV from a technology development perspective. The goal is to discuss the aspects of QoE that near-future mediated communication solutions are likely to consider. As stated in [365], most video conferencing tools [...] are geared toward voice-heavy, video-heavy, or PowerPoint-driven communications rather than collaboration.
New developments in immersive communication in Virtual Reality (VR), Augmented Reality (AR), or Mixed Reality (MR) environments can close the gap in communication systems to allow more natural remote (computer-mediated) communication [366] , as well as possibly allowing completely new forms of communication and interaction [367]. To evaluate such immersive remote communication, the authors consider social presence or co-presence as one of the key constituents to estimate the QoE of users, and as such how well an immersive telemeeting system can reproduce natural interactions [141].
The legacy videoconferencing systems discussed up to here in this paper have become a true alternative to physical meetings and traditional telephony. The usage of videoconferencing systems reached another level as a result of the Covid-19 pandemic during the years 2020 and 2021, when the world's population was forced to apply physical distancing as a strategy to fight the dissemination of the virus [368]. As a consequence, social presence and novel ways of mediated communication and virtual activities were sought more than ever before, see e.g., [369] on the tradeoff between physical and virtual activities.
Thus, with a strongly enhanced need for remote working and virtual get-together, there are many incentives throughout the telecommunication industry and research landscape to mitigate the drawbacks of current telemeeting solutions. Prolonged use of videoconferencing systems is found taxing on the HIFs (see Section IV) of the telemeeting QoE and may result in fatigue and increased cognitive load due to the unnatural communication setting, reduced mobility, and the additional effort required to send and receive non-verbal communication, an effect dubbed Zoom-or videoconferencing fatigue [1], [131]- [133], [266]. As several system influence factors mediate videoconferencing fatigue (see e.g., [1]), one solution is to improve the existing videoconferencing tools and streamline the communication experience (as discussed e.g., in [132]). Another direction is to create new solutions for the future that increase the naturalness and social presence as well as co-presence [370] -that is, the feeling of being in a place with one or more other persons at the same time -of mediated communication [138], [139]. These developments are aligned with recent advances in immersive technologies, VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  [365], [371], [374].
in particular those enabling an eXtended Reality (XR) experience, that are promising to improve the immersion and presence in shared media consumption and communication [251]. This way, some of the interaction effects between QIFs related to the system (SIFs) and human users (HIF) of existing videoconferencing systems can be overcome. In order to successfully do so, the near-future XR systems must address a number of relevant SIFs, HIFs, and CIFs, reflecting the expected increase of data bandwidth, the need for realtime user tracking and novel system interfaces, for example. XR is a term referring to all types of environments that employ Virtual, Augmented or Mixed Reality (VR/AR/MR) technology, and human-machine interactions enabled through computer technology and wearables. Here, the "X" represents a variable for any current or future spatial computing technology, or simply the "X" in eXtended Reality. One main differentiating factor for XR is the level of Degrees of Freedom (DoF) of user exploration and interaction, which expresses the level of freedom a user has to look at different parts and angles of the media content. The DoF goes from head-rotation-only 360-degree video (3-DoF, head movements in terms of pitch, yaw, and roll) to full movement as 6-DoF (3 DoF plus three translatory coordinates x, y, z) and offers different degrees of immersion (for more see e.g., [371]). When referring to XR systems that are designed for immersive communication, this paper refers to such technologies as Social XR, a term that is used both in industry, e.g., [372], and in science, e.g., [373].

B. STRUCTURED OVERVIEW OF CURRENT TRENDS IN SOCIAL XR
When looking at the past developments and current trends in Social XR, the authors observe two main types of target experiences and two main lines of technology development.  In terms of target experiences, developments either aim for getting XR as similar as possible to reality (i.e., extending or replicating reality) or aim for allowing experiences that are not possible in reality (e.g., by being partly or completely different by design, for example by enabling gaming-type functionalities such as being in certain, remote virtual places, or teleporting oneself between places). XR solutions may be applied for different telemeeting purposes, reflected in the CIFs such as the communication scenario and environment. Here, Social XR has a wide range of targeted and in part concretely specified use cases. Figure 10 provides a nonexclusive list of the most relevant use cases, partially based on [365], [366], [371], [374]. Several of these use cases substantially differ from each other. Therefore, it is currently hard to image a simple one-fits-all technical solution that will satisfy all requirements for all use cases. This can be illustrated when considering VR in comparison to AR, and corresponding differences and abilities with respect to rendering, the realization of meeting spaces with co-presence in different real or virtual environments, the enabled degrees of freedom in movement and interaction with users.

Level of Immersion Level of Photo-realism
In terms of technology, developments either extend conventional video conferencing solutions [377] or aim for a completely new volumetric technology [377], [378], which brings them perhaps also closer to gaming technology. These technology development paths have profound effects on various SIFs, beginning from setting up and controlling the telemeeting to the aforementioned media richness aspects (see Section IV).
Another approach to characterize the technological evolution of communication systems towards Social XR is to look at the various improvements from two specific angles: Level of Realism, and Level of Immersion, as illustrated in Figure 11.
A high level of immersion at a low level of realism, i.e., based on computer graphics, is particularly impacted by principles of the computer gaming industry. One relevant development was Second Life (e.g., [375]) which offered a massive multiplayer immersive communication experience. With recent advances in VR Head-Mounted Display (HMD) technology, this resulted in several solutions to offer immersive VR experiences. An example of such VR communication platforms is Facebook Horizon, as the successor of Facebook Spaces [376]. With respect to the QIFs, increasing the level of immersion (for definitions, see, e.g., [139]) is intrinsically linked with one sub-category of HIFs, namely the state inside individual participant, feeling immersed and possibly present in a certain environment.
Looking at the increase in the level of immersion with a high level of realism, the existing, legacy computer-based videoconferencing services can be mentioned. One reason for their success is that these services aim for high video and audio quality, sometimes augmented with more advanced spatial audio capabilities, aiming to improve the media richness aspects, one sub-category of SIFs. Such tools are increasingly combined with further messaging and team-meeting capabilities, so that different teams may be created that can easily launch brief video-meetings if needed. Even more immersive videoconferencing solutions based on legacy technology exist, for example presenting visual information using a projection-based CAVE system (Cave Automatic Virtual Environment [379], or telepresence systems that use life-size displays spatially arranged around a common meeting table [380]- [382]. The intention of such systems is to increase interaction and more natural conversation by positioning projectors and screens to render users in life size.
For the future, setups such as Holoportation from Microsoft [383] promise to allow full body volumetric capture, transmission, and rendering of the user's body, usually referred to as holographic projection. As a consequence of such developments, different standardisation bodies are now starting new work items focusing on technical specifications for "fully virtual meetings" using holographic projection and aiming for setups including hotel halls, stadium, congress center, etc.
For these developments, placement and proxemics issues become important aspects of QoE. Placement refers to the relative location of different users and in particular to the question of how to place participants in XR so that they experience the same room when they are actually situated in highly dissimilar physical rooms/environments [384]. Proxemics refer to the requirement that users should respect each other's personal spaces, as the perceived interpersonal distance (proximity) is a significant determinant of social presence and quality of communication in immersive VR [385].

C. RENDERING TECHNOLOGY IN SOCIAL XR
One key component of Social XR, like any other XR application, is the rendering technology used. From a QoE perspective, the rendering technology determines a number of SIFs that can be considered in view of the media richness theory discussed in Section IV.
Rendering technology in the context of Social XR applications can be clustered along two dimensions. One dimension is the enabled DoF, ranging from 2D screens with the users confined to a rather narrow field of view in front of their screens [132], typically with non-spatial audio, over 360 VR (3-DoF) to 6-DoF VR or AR, including representations of the other participants that more plausibly integrate with the virtual (VR) or real environment (AR), including either headtracked headphone-or loudspeaker-based spatial audio. The other dimension is the user representation, ranging from artificial avatars to photorealistic representations based on conventional video capture, or video-or geometry-based Point Clouds, or other volumetric representations [377], [378], [386].
For visual information, the term rendering defines the automatic process of generating digital images from threedimensional models. A rendering engine can simulate an almost infinite and hence real-life-like range of illumination and color settings. However, current displays --movie screens, computer monitors, etc. --cannot handle the required peak luminance, contrast ranges and color gamut settings, so that some of the information must be discarded or compressed, reducing the resulting scene naturalness. Here, the fact that the human visual system also has its limits can help to suggest which short-cuts could be used in the rendering process to overcome technical limitations without a noticeable difference in user perception [387].
In addition to visual rendering, audio rendering in XR telemeetings may involve a dedicated positioning of the audio objects, representing the conversation partners at spatial locations that match the visually rendered scene an that therefore appear more natural to the user. A first step in this direction may be achieved by extending traditional telemeetings with spatial audio reproduction techniques [388]. Headphone playback without spatialization results in audio objects that appear localized inside the listener's head, or that all appear co-located in the same position, or with a reduced audiovisual spatial congruence of their perceived auditory and visual stimulus components. The QoE-related benefit of spatial audio rendering in telemeetings [32]- [35] as well as the spatial alignment of audio and video rendering [389], [390] have been investigated in the past. Professional telemeeting solutions with spatial audio have been introduced to the market as well, such as, e.g., Bluejeans [391], BT MeetMe [392] or the former Cisco telepresence solutions TX9000 [393] and IX5000 [394].

D. CHALLENGES IN SOCIAL XR
Despite the progress in Social XR, a number of technical challenges partially remain: In order to achieve perceptually plausible localization results with binaural headphonebased reproduction that includes appropriate externalization (e.g., [395]), head tracking or the usage of reasonably individualized head-related transfer functions become essential [396]. There are a number of further aspects that may become key factors in high-quality XR telemeetings, but are VOLUME 4, 2016 still open challenges, for example, simulating the acoustics of a virtual meeting room, embedding a virtual audio object in a real acoustic environment in an AR use case, removing the acoustic cues of the physical room where the speech signal is captured -see e.g., [395], [397], [398] on understanding such cues from a QoE perspective -and removing any unwanted background noise.
Besides an improved auditory scene analysis and support in solving the Cocktail Party problem [258], the interaction with others will also become more natural with spatial audio, beyond the spatialization and fixation to a 2D video screen [388]. Here, users can turn to others like they would, for example, during a face-to-face meeting or at a party, to engage in a temporal, smaller-group interaction.
Moreover, not only linguistic, but also nonverbal communication can be enhanced with XR-based technical mediation [399]: Facial expressions and especially bodily gestures can better be captured, transmitted, and displayed in XR. In the future, systems will likely even enable eye contact, like when being face-to-face in the same space. XR and a more holistic capture and display will also facilitate turn taking, since more cues indicating the intention to take the floor can be communicated.
With respect to user representation, a collection of rapidly developing technologies, including a suite of artificial intelligence (AI) tools, next-generation game engines, and augmented reality technology, bring on a new era of artificially intelligent avatars. An avatar, in this case, refers to any kind of user representation, either as a real person or as an artificial, simulated user agent. Directly related to artificial avatars is the uncanny valley effect [400]. The Uncanny valley describes an observation of human perception where a certain, yet imperfect level of human likeness of an avatar causes negative emotions and discomfort for users. While both a low and a very high level of human likeness are perceived positively, some level in-between is perceived negatively. Thus, the problem really starts when one combines or reproduces photorealistic representations of humans with computer-generated content. Interestingly, there also appears to be an "Uncanny Valley of Telepresence": the user's sense of telepresence (the illusion of "being there") increases with simulation quality up to a turning point, after which it begins to deteriorate, probably because the user's expectations start to exceed the actual affordances provided by the system [399].
Many technical advances have been made to unify XR platforms and devices (e.g., [401]), to capture virtual environments and users in photo-realistic quality, as well as to encode, store, and transmit 3D data (see [371]). Still, many technological limitations and challenges exist on each part of the XR ecosystem (i.e., XR frameworks, systems, and end-devices) [402]. Two particular new technologies that are expected to improve XR in mobile scenarios are 5G and remote rendering (in the cloud or at the network edge), especially as one can expect any XR device to be lightweight and thus potentially low-powered [381]. Both 5G and remote rendering individually and together will allow to shift resources from the end devices into the system, and thus to increase the rendering quality and performance of XR applications.
At this point it should be noted that this section was written as an initial overview of the technological trends and challenges of XR-based communication, and is by no means complete. A follow-up, in-depth paper will address this topic in more detail. In the present paper, the aforementioned concise technology review shall serve as the basis for the following, initial analysis of QoE assessment for XR-based communication.

E. QOE ASSESSMENT OF XR-BASED TELEMEETINGS AND SOCIAL XR
Understanding QoE in relation to Social XR is largely an open challenge. Partly, Social XR shares QoE Constituents with VR and AR, where, for example, measuring simulator sickness or quantifying the level of spatial presence have received a lot of attention [184], [403]. In the domain of AR and VR assessment, different assessment techniques are known, ranging from direct methods using questionnaires, e.g., regarding presence [175] or simulator sickness [183] to indirect methods using, for instance, physiological measurements [404], [405] or task performance, e.g., using wayfinding analysis [406], [407]. Furthermore, it is obvious that the underlying aspects of spatial auditory, visual and audiovisual perception and QoE evaluation play a role. Here, so far, only a few systematic or standardized assessment approaches exist. A set of aspects relevant in this regard is contained in the Profile Template instantiated in Section VIII-A.
Two other constituents, which Social XR brings to the attention of developers and researchers are co-presence, that is, the experience of being with others [137], [273] and social presence, that is, the feeling of co-presence and having an affective and intellectual connection with other persons [140], [141]. At the same time, a challenge for Social XR is the broader societal acceptance of being virtually and thus socially present in one location while being physically present in another, without being in contact with those in that physical environment. This might happen, for example, in the case of attending a virtual conference that may span over multiple days and occur in a different time zone, disrupting the daily routines of one's physically co-located social group, such as the family.
Lastly, an important challenge to consider is the ethics of XR use [408], [409]. XR telemeetings may make it easy to forget the rules of human interaction and enable immersive experiences that might be harmful or unpleasant. This can happen via inappropriate communication behavior due to cultural differences of the participants or due to the lack of physical co-presence and the resulting behavior mediation -see, e.g., [410] on rudeness in social media or [411] on rudeness in physical and computer-mediated work contexts. Or, this can happen due to errors and inappropriate decisions concerning the system design and development, see, e.g., [412] for existing industry guidelines on creating respectful, safe, inclusive, and accessible XR environments. One possible counter measure is the introduction of impenetrable personal zones to prevent that people can invade each others personal space, see, e.g., [413]. To summarize, an XR system, for which the level of realism can be controlled and contentinduced risk is minimized, may be a key to a high-quality Social XR for all populations. For a review on these aspects, see [414].
Similar to the technological trends discussed before, this QoE-related section is intended as an entry point to the field of QoE assessment of XR-type telemeetings, indicating how prior work on VR and AR evaluation can form a basis for the case of interactive Social XR systems. A forwardlooking analysis of telemeetings will be addressed in more detail based on the different projects and research activities running, for example, in the authors' different labs and institutions, and accompanying standardization activities in ITU-T Study Group 12 (Questions Q7, Q10 and Q13/12) and other standard development organizations (e.g., 3GPP SA4 IVAS project).

F. IMPACT AND FUTURE OF SOCIAL XR
First of all, it is important to stress that one should not regard Social XR as a replacement technology for any of the other existing communication channels. It is clear that telephone calls and traditional video conferencing will still have a clear value, at least in the near to mid-term future. However, with the further development of immersive communication and Social XR, many new use cases and a more natural interaction with high social presence will be possible [415], [416].
For the future, the authors expect that virtual and augmented reality and the real world can blend into each other and will completely change the way we experience mediated communication in general, and multiparty communication in particular. Here, XR communication has the potential to transform the everyday communication of people: On the one hand, by allowing new forms of communication in digital worlds that are currently not possible, and, on the other hand, by allowing better and more natural, intuitive communication between people. Technological breakthroughs towards more natural telemeetings might come in the form of understanding and modeling interaction intent (e.g., taking a holistic perspective of the Cocktail Party problem [258]), enhancing back-channel communication to disambiguate uncertainty (e.g., eye and face tracking to model gaze and facial expressions, posture tracking), novel interfaces beyond visual and auditory modalities, or adaptive and personalized communication systems based on user actions and feedback (e.g., individualizing audio delivery, correcting hearing or vision impairments).
This can have a direct impact on the life of tomorrow and may lead to a more sustainable future; to name a few benefits: inclusion of the elderly, inclusion of people with disabilities, reducing unnecessary travelling by providing adequate telemeeting alternatives, breaking communication barriers, or creating more awareness of world problems like diversity/climate change/populism, by virtually transporting people to the actual place of events, and enabling them to witness issues with their own eyes, concepts that would fall under the prospect of immersive journalism [417]. However, to elevate XR technology to this level, we need a much better understanding of the underlying user requirements and QoE in XR.

VIII. TOWARDS A HOLISTIC EVALUATION OF TELEMEETING QOE
Up to this stage, we have discussed the results of an extensive survey on the ingredients of telemeeting QoE, and have provided a short outlook on the future of telemeeting technology in the form of Social XR. One next, application-oriented step for a technical exploitation of this body of knowledge is to answer the question how these ingredients can be assessed in practice for a given telemeeting system. For that reason, this section builds on the survey on QoE assessment approaches by discussing a novel approach to characterize telemeetings from the holistic perspective endorsed in this paper.

A. PROFILE TEMPLATE FOR CHARACTERIZING TELEMEETINGS
Many of the more conventional QoE evaluation methods mentioned in Section VI are tailored to a specific test scenario, are focusing on certain individual aspects of a telecommunication system, or have not been developed with modern (multiparty) telemeeting systems in mind. To account for these drawbacks, existing efforts to guide investigators to an appropriate perceptual QoE evaluation method for telemeetings -and here in particular ITU-T Recommendations P.1301 and P.1310 [27], [332] -dissect the test cases at hand in order to identify the best matching existing evaluation method, as well as any potentially necessary adaptations of those methods. In that respect, those approaches already took the first steps towards a more holistic perspective on telemeeting QoE.
To attain a truly holistic perspective on telemeeting QoE, however, it is of great use to go one step further and provide a conceptual tool, which allows a systematic, agreed-upon and therefore comparable characterization of telemeetings to be obtained. This leads to the concept of a Telemeeting Profile Template, that is, a structured list of aspects that (a) characterize telemeetings and (b) are relevant from a QoE perspective. Benefits of such a characterization are, for instance, having a guidance when choosing an appropriate QoE assessment method, having a set of descriptors for a precise communication about telemeeting QoE, and having a means to develop a taxonomy of telemeetings. See Section VIII-C for further elaborations.
The Telemeeting Profile Template can be seen as a kind of check list containing attribute-value pairs. Accordingly, the Telemeeting Profile Template has two main columns: The first represents the list of characteristic aspects (the VOLUME 4, 2016 9: Explanation of the columns in the Telemeeting Profile Template provided in the supplementary material, i.e., the frozen version in [26] and the development version in [25].

Column
No.

Column Heading Description
A -E Columns corresponing to the tables on Quality Influence Factors in this paper These columns contain essentially the same information as in Tables 3 to 5  This column cites publications that show a rather direct link between the attribute and QoE. Additional comments in this column mention the essential aspect(s) of that link.

I Other Scientific Evidence
This column cites publications that provide some supporting knowledge, although they may not show or discuss a clear proven link between the attributes and QoE. Additional comments in this column mention the supporting aspect(s).
J -L Practical Relevance: More Detailed Info These columns discuss the practical relevance of an attribute. This relevance was assessed by the authors with their different expertise in this field, while further feedback was given by experts from ITU-T Study Group  attributes); the second contains possible instantiations for each aspect (the values). As an example, one attribute in the first column is the communication modality, and the possible values are audio, visual, audiovisual, tactile, texttype, and graphics information for current and future, multisensory telemeeting systems [251], [418]- [420]. Apparently, combinations of values may be possible as well: Modern telemeeting systems allow to combine different communication modalities, e.g., audiovisual communication with additional text chat. With respect to the list of identified attributes, the authors opted to refer to the Quality Influence Factors (QIFs), see Section IV, and use them as a tool to characterize telemeetings. As a consequence, this list of QIF-type attributes, was developed in the systematic way outlined in Section III,which went hand in hand with the literature survey for Section IV.
With respect to the values that are used for each attribute, one challenge is to find a good balance between covering all different possibilities and keeping the Telemeeting Profile Template manageable and comparable. This is especially the case when an attribute refers to some technology aspect which can actually have many different implementations. For that reason, more suitable values for technical aspects could represent a higher-level description instead of terms referring to variants of concrete implementations, such as monotic, diotic, stereo, binaural, multichannel as examples of values for the attribute spatial audio.
The resulting Telemeeting Profile Template is realized in the form of a large table which is provided in the supplementary material of this paper, i.e., a frozen, non-evolving version in [26] and a development version in [25], which can be modified, improved and extended also based on the feedback from readers of the present paper, as further outlined in the following Section VIII-B. To get a better overview of the information that constitutes the Telemeeting Profile Template, Table 9 provides an explanation about the different columns used in the supplementary material.

B. ONGOING DEVELOPMENTS CONCERNING THE TELEMEETING PROFILE TEMPLATE
This paper presents a first stabilized version of the Telemeeting Profile Template, more precisely the list of QIFtype attributes to be considered. The intention is to have a starting point for using and evaluating the Telemeeting Profile Template, in order to assess its validity and applicability. Moreover, additional work is necessary to obtain a set of concrete suggested values to complement the Telemeeting Profile Template. First suggestions by the authors can be found in the supplementary material of this paper [25], which is a commentable online document. Here, the authors plan to continue the development and invite interested researchers and practitioners to contribute. Going one step further, having a standardized set of attributes and values will be ideal to address this challenge. Here, further work in research and practical application is expected to help improve the list of values, which eventually could even lead to a standardized list of recommended attributes and values. For that reason, the authors plan to continue refining the Telemeeting Profile Template, also based on readers' feedback, and to contribute a more stable version to ITU-T Study Group 12 for consideration as a future standard.

C. WORKING WITH THE TELEMEETING PROFILE TEMPLATE
The individual deployment of the Telemeeting Profile Template depends on the actual use case. To illustrate the usage, the following paragraphs describe three possible application scenarios in more detail. The main target group considered in these three use cases are researchers and practitioners who are conducting QoE assessment campaigns of telemeeting systems, either during development or when the system is already in operation.

1) Finding an Appropriate QoE Assessment Method
At this point, the Telemeeting Profile Template can help in two ways. On the one hand, a more systematic characterization of the telemeeting can assist to better specify the test scenario, which in turn helps to find an appropriate QoE assessment method more efficiently, using the pointers in the Telemeeting Profile Template as discussed in the previous paragraphs. On the other hand, such a detailed characterization of telemeetings can help to identify whether an existing method may be used without change, whether an existing method needs to be modified, or whether a new method needs to be developed.

2) Communicating about Telemeeting QoE
Since telemeetings can be very different in their character, a precise communication about them can become challenging. This is, for instance, the case when a researcher or practitioner is asked to report about some QoE assessment campaign of a telemeeting system. Especially when a comparison with other systems is requested, the reporting person needs to be able to correctly interpret results in the context of the respective use case and system instances. Here, the Telemeeting Profile Template can help to characterize the respective telemeetings regarding the QIFs that have been addressed in the assessment campaign. This in turn minimizes the risk of misinterpretation and miscommunication.
One main challenge, however, is to balance between concise communication and using an extensive list of attributes. One possible approach is to separate between the analysis/comparison step and the communication step, that is, to consider the full template to identify all relevant commonalities or differences, while focussing on a set of main aspects in the communication. Here, future feedback from researchers and practitioners is sought to improve the usefulness of the Telemeeting Profile Template.

3) Developing a Taxonomy of Telemeeting Systems
This use case picks up an underlying aspect of the two previous use cases: the potential benefit of a brief but precise categorization of telemeeting systems.
The purpose is to deploy the Telemeeting Profile Template in terms of a taxonomy. When selecting values that characterize different systems or the telemeetings typically held across these, the taxonomy-type character of the template becomes apparent.
One possible further work could hence be based on a data-driven approach: (1) characterize a representative number of different telemeetings using the Telemeeting Profile Template; (2) run a data analysis to identify a set of attributes that appear to be strong discriminators; (3) construct a first visualization of categories and the systems that belong to these, along with the set of attributes. As a result, the categories could be employed to analyze aspects such as system acceptance, user-groups typically employing these, or features that may be missing in specific cases. Another direction of future work is to systematically analyze mutual dependencies between attributes.

IX. CLOSING REMARKS
Telemeetings have been and will remain important for our professional and private lives, and are likely to become even more important in the future. This is shown by the developments during the past decades, the current situation during the Covid-19 pandemic, and recent technological, economic, societal, and climate-protection trends. Given such a major role of telemeetings, system developers and service providers VOLUME 4, 2016 should enable an optimal experience for the user, while at the same time keeping technical and financial resources at bay. For that reason, there is a need for understanding the detailed factors -ingredients -that contribute to a best possible Quality of Experience of telemeetings. And there is a need to be able to characterize those ingredients and their impact on QoE, both in a qualitative and a quantitative manner. Moreover, it is beneficial to understand in which directions the current technology developments are heading.
To address such needs, this paper analyzes current and near-future telemeeting services from a QoE perspective. In the first part of the paper, the authors provided an extensive survey of the numerous factors and processes that contribute to the QoE of telemeetings in order to achieve a holistic understanding of telemeeting QoE. As a next step, the paper introduces the current state-of-the-art of QoE assessment of telemeetings as well as ongoing developments. Then, the authors provided a glance towards the near future, where immersive technologies are considered to enable a new form of Social XR telemeetings with an improved experience of co-presence. Social XR will bring about new interaction interfaces, modalities, and types, which will require QoE evaluation methods beyond the current standards. To conclude the survey and technology outlook, the authors presented the Telemeeting Profile Template. It is a tool for practical guidance on telemeeting analysis and QoE assessment, intended to help with finding an appropriate QoE assessment method and creating a unified language for communicating about telemeeting QoE.
With the provision of a commentable, online version of the Telemeeting Profile Template [25], the authors wish to foster exchanges with other researchers and practitioners in the field, so as to expand the body of knowledge on telemeeting QoE assessment. In follow-up research and development work in their different institutions, the authors currently investigate how to best evaluate the QoE of future, Social-XR-type telemeetings, as will be described in corresponding future publications.
To wrap up, telemeetings represent a highly multidisciplinary field; they play an important role in our lives; they undergo promising technological developments; and they have the potential to provide global access to communication with other people, education, knowledge, and culture. The authors look forward to the upcoming forms of telemeetings in terms of technology, Quality of Experience, and usage scenarios and hope that this paper helps the interested reader to dive into a field that affects people and technology at the same time. The story of telemeetings -to be continued.
JANTO SKOWRONEK is currently managing director of the research thrust Smart Technologies, Processes and Methods at the Hochschule für Technik Stuttgart -University of Applied Sciences in Stuttgart, Germany. In this role he is responsible for both strategic development and background operations of the research thrust in which about 30 professors and about 30 research staff members are active. Moreover, Janto is also coordinating a development team that is creating a transfer platform for the university and he is engaged in various activities for improving work processes in the university with digital tools. Especially for the latter role, Janto draws from his scientific expertise on Quality of Experience of Telemeetings, which he built up during his research assistant and post-doc time at TU Ilmenau, Germany, from 2015 to 2018 and his PhD research at TU Berlin, Germany, from 2010 to 2015. Since 2012, Janto is co-rapporteur of Question 10 "Conferencing and telemeeting assessment" of ITU-T Study Group 12