A Metaverse: Taxonomy, Components, Applications, and Open Challenges

Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is based on the social value of Generation Z that online and offline selves are not different. With the technological development of deep learning-based high-precision recognition models and natural generation models, Metaverse is being strengthened with various factors, from mobile-based always-on access to connectivity with reality using virtual currency. The integration of enhanced social activities and neural-net methods requires a new definition of Metaverse suitable for the present, different from the previous Metaverse. This paper divides the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) and three approaches (i.e., user interaction, implementation, and application) rather than marketing or hardware approach to conduct a comprehensive analysis. Furthermore, we describe essential methods based on three components and techniques to Metaverse’s representative Ready Player One, Roblox, and Facebook research in the domain of films, games, and studies. Finally, we summarize the limitations and directions for implementing the immersive Metaverse as social influences, constraints, and open challenges.


I. INTRODUCTION
Metaverse is expanding rapidly, as seen in Geppetto serving 200 million subscribers and Animal Crossing running an election campaign in a virtual space. In particular, Roblox's monthly active users (MAU) is 150 million, which is used by 2/3 of children aged 9-12 in the US, and 1/3 of them are under 16 [1]- [3]. Early studies for the Metaverse focus on Second Life in 2006 [4]- [6]. However, the current Metaverse is based on the social values of Generation Z that online ego is no different from offline ones [7]. Therefore, since the proportion of social activities and contents grows, it differs from the previous Metaverse, and a new definition is needed for the present.
The novel Metaverse differs from the earlier Metaverse in three ways. First, the rapid development of deep learning dramatically improves the accuracy of vision and language recognition, and the development of generative The associate editor coordinating the review of this manuscript and approving it for publication was Sudhakar Babu Thanikanti . models enables a more immersive environment and natural movement. The processing time and complexity were reduced using multimodal models as E2E (end-to-end) solutions with a multimodal pre-trained model. Second, Metaverse previously served based on PC access and had low consistency due to time and space constraints, but now it is possible to easily access the Metaverse anytime, anywhere due to the mobile devices that can connect to the Internet at all times. There are 50 million games in Roblox and the accumulated monthly usage time is 3 billion hours. People consums more time than social network services (e.g., TikTok, YouTube). It has a virtuous cycle ecosystem in which the inflow and income of producers increase as users and usage time increase while serving various contents, and thus sales of digital advertisements increase. Lastly, the current Metaverse differs from the previous one because the program coding can be done in the Metaverse world, and it is more bonded to real life with virtual currency. Metaverse expands with various social meanings (e.g., fashion, event, game, education, and office) based on immersive interaction.
Cryptocurrencies (e.g., Dime) serve as an economic bridge between the Metaverse and the real world, giving people deeper social meaning.
The Metaverse differs from augmented reality (AR) and virtual reality (VR) in three ways. First, while VR-related studies focus on a physical approach and rendering, Metaverse has a strong aspect as a service with more sustainable content and social meaning. Second, the Metaverse does not necessarily use AR and VR technologies. Even if the platform does not support VR and AR, it can be a Metaverse application. Lastly, the Metaverse has a scalable environment that can accommodate many people is essential to reinforce social meaning. The large-scale Metaverse implementation required three components: (i) hardware improvements (e.g., GPU memory, 5G); (ii) the development of the recognition and expression model that leverages the parallelism of the hardware; and (iii) the availability of content that people immerse in and participate in.
Despite the considerable research relating to Metaverse, primarily focus on social meaning, and little attention has focused on technologies for the Metaverse. For example, a systematic approach to what concepts and technologies are required to create an environment and content that users can enjoy in Ready Player One is needed. Beyond simply creating a physical, virtual space, it is able to provide an immersive experience with a story through user interaction. This research presents a comprehensive study on the applications and technologies that can give social meaning in a Metaverse hardware, software, and content with three approaches (i.e., user interactions, implementation, and applications).
Firstly, this study analyzed hardware components, software components, and contents into component levels to create an immersive experience in the Metaverse, as shown in Fig. 1. In order to give the user a sense of visual immersion, a lightweight head-mounted display (HMD) and physical auxiliary devices are required to use for a long time with high-resolution images [8]. In terms of software, as the delay increases, dizziness and motion sickness occur due to sensory confusion, so low delay and fast rendering are important. In addition, since Metaverse is conducted based on a wide 360-degree field of view, large-capacity vision data processing and generative recognition of obscured objects are significant issues. There is a technical gap in hardware and software performance compared to users' expectations for the Metaverse. The natural movement of the graphical environment displayed in the Metaverse can give an immersive feeling. However, to provide a sustainable service, it is crucial to have immersive contents that work even with limited hardware and low-resolution software (e.g., Minecraft). For sustainable Metaverse content, there must be a plot that considers various user interactions in the virtual environments. In other words, an approach in the form of a complete story (e.g., a movie and a drama) is needed rather than several dialogues turn. Because user-created content is not produced with a large number of organized teams, methods (e.g., persona generation, cartoon generation) to complement professionalism are needed. In addition, multimodal-based stories based on immersive interaction can be used effectively in the Metaverse to implement such an interactive user scenario.
Secondly, this study analyzed user interactions, implementations, and applications into approach levels to provide a stable experience in the Metaverse, as shown in Fig. 2. Comprehensive recognition and interpretation through multimodal inference are required for an interaction that can effectively utilize the technologically growing hardware and software performance. For example, human-robot interaction and visual-language interaction are similar in that they are egocentric views to be used as element technologies for user interactions. Metaverse environments are divided into service platforms (e.g., Roblox, Minecraft) and configurable environments (e.g., Unity) for implementations. In order for the Metaverse to allow many people to live life in the same space, infrastructure elements (e.g., wide bandwidth network connection, fault management, and security) are also important for implementations. The details of each application and event also play an important role in composing the Metaverse. As applications and events (e.g., simulation, marketing, and education) become more concrete, people's activities will increase, and their playing time gradually increases accordingly [9], [10].
Due to the wide scope of the Metaverse, we lack a clear understanding of how they work, why they need, and what they are even capable of due to their novel component. To tackle these problems deeply, we require interdisciplinary collaboration and research with the psychology and social sciences of the Metaverse.
This study has three main contributions as follows.
• Metaverse taxonomy is proposed by summarized technologies and is used to classify the studies of research institutes. We classified the Metaverse components and major approaches into hardware, software, contents, user interaction, implementations, and applications. For each approach, we have summarized the technologies that have recently become issues and interests.
• We classify Metaverse's representative Ready Player One, Roblox, and Facebook research in films, games, and studies using the method defined above and describe the latest technology and development.
• Finally, problems and directions in implementing an immersive Metaverse are divided into social influences, restrictions, and open challenges.
The remainder of this study is organized as follows. In Section 2, various definitions of Metaverse and avatar are arranged in chronological order. Section 3 describes and proposes three components necessary, and Section 4 describes the high-level approach to give an immersive experience in the Metaverse. In Section 5, we verify how the defined components and approaches are used through case studies on Ready player one, Roblox, and Facebook research. Influences, limitations, and open challenges are discussed in Section 6 and finally concluded in Section 7.

II. METAVERSE CONCEPTS
This section describes the concepts of the Metaverse, avatar, and extended reality (XR) based on differences of similar concepts. The Metaverse refers to the virtual world in which the avatar acts, and the avatar is the user's alter ego and becomes the active subject in the Metaverse. XR is the medium that connects avatars in Metaverse and users in the real world.

A. METHODOLOGY
In this paper, we partially utilized systematic literature reviews (SLRs) techniques to obtain reliable references [11]. The reclusive procedure for selecting references is: 1) search by combining related keywords 2) extract papers that contain keywords in the title and body 3) remove papers that contain keywords but are not directly related to the Metaverse 4) cluster related papers 5) configure taxonomy. At first, we extract keywords (i.e., Metaverse, Avatar, Extended Reality) for Metaverse concepts that are interpreted in various forms, as shown in Table 1. As depicted in Fig. 3, each paper's Metaverse definitions and characteristics are analyzed in chronological order out of a total of 260 papers, including 130 papers of Elsevier papers and 130 papers of Google scholar based on relevance. Table 1 summarizes the definitions and main viewpoints of 54 papers that specifically describe the Metaverse [4], [7]- [10], [12]- [60].
In Section III, we make component taxonomy (i.e., hardware, software, contents) which is necessary to construct the Metaverse with the same procedure including a total of 15 sub-categories. In a similar way, we construct an approach taxonomy (i.e., interaction, implementation, application) inclduing 16 sub-categories for Metaverse approaches in Section IV. Finally, we choose representative services of the Metaverse and evaluate taxonomy by mapping references. Especially in the case of Facebook, we review papers that are announced on the papers published in Facebook Research from January to June 2021.
Duan et al. [7] presented the representative applications in the aspect of infrastructure, interaction, and ecosystem. They also provide a three-layer Metaverse architecture containing ana brief timeline for Metaverse development. Messinger et al. [23] introduced a virtual world where thousands of people can interact simultaneously within the same simulated 3D space. It covered the perspectives of business, education, social sciences, technical sciences, and social computing that affect our society as a whole. Müller [33] defined the world as an electronic memory and the Internet as a virtual reality where users log in every day. They focus on safely preserving information, the evaluation of data, and perception. Dionisio et al. [43] focused on immersive realism, the ubiquity of access and identity, interoperability, and scalability of Metaverse. Nevelsteen [56] focused on ontology as the relation of the complimentary terms and acronyms. They also introduced the usage of pseudo persistence to categories technologies that only mimic persistence. Since various Metaverse surveys mainly focus on application and social meaning, comprehensive research on Metaverse technology is lacking. In order to compose the Metaverse, it is necessary to investigate a comprehensive view of the latest technology components, approaches, and services. We compare Metaverses defined in 54 other surveys in Table 1 and dealt with HW, SW, and Contents in depth. In particular, we evaluate the proposed taxonomy through three different types of use cases.

B. METAVERSE
Metaverse is a compound word of transcendence meta and universe and refers to a three-dimensional virtual world where avatars engage in political, economic, social, and cultural activities. It is widely used in the sense of a virtual world based on daily life where both the real and the unreal coexist [61]. Metaverse was first used in Neil Stevenson's science fiction novel Snow Crash in 1992 and referred to a world where virtual and reality interact and create value through various social activities [62]. As the scope of the Metaverse is wide and continuously growing, various definitions and similar concepts exist. Lee et al. [63] divided life-logging, mirror world, augmented reality, and the virtual world according to whether the implemented space is realityoriented or virtual-centered, and whether the implemented information is external environment information-centered and individual-centered. In previous studies, Metaverse focused on the composition of the virtual world itself (e.g., game), but recently, it is often expressed as a medium for exchanging interests and social interaction centered on content.
Mirror world (e.g., Google Earth, Microsoft Virtual Earth) refers to extending information into the virtual world by realistically reflecting the real world. Mirror World is originated from a book called Mirror Worlds written by David Gelernter in 1992 [64]. The real space where people live is reproduced in digital form, and additional simulation information is added. In other words, the mirror world replicates the appearance of buildings or objects in the real world but has its own properties and functions. Metaverse, multiverse, digital terraforming, and mirror world are conceptually similar but have slightly different meanings depending on where they are used and share some concepts.

C. AVATAR
An avatar means an alter ego that has descended to the earth, and it started from the concept that a fundamental being (e.g., God) changed its form to human. Previously, the avatar was used as a pre-defined exaggerated form in the virtual world rather than reflecting the real world. However, it gradually changes into an ideal form that projects the outward appearance and reflects the ego. An avatar performs a social role suitable for a job and persona in Metaverse. In particular, costumes and items in Metaverse are used as a VOLUME 10, 2022 FIGURE 1. Organization of the paper. medium to express the social meaning of avatars, and various luxury clothing companies are paying attention and selling them. The younger generation considers the social meaning of the virtual world as important as the real world, as they think that their identity in virtual space and reality is the same.  Avatar, the subject of the Metaverse, has a similar meaning to the digital twin and digital Me of the virtual world. A digital twin is a virtual model for predicting behavior [65]. Digital twins are used to create real-object-like agents in the virtual world and predict outcomes in advance through simulations of situations that might occur in real life. Initially proposed by General Electronics, the system combines data and information representing contexts and processes of various physical entities to understand past and present operating states. It is used to maintain properties and states throughout the lifecycle of a digital twin and predict what will happen in the future. It can optimize the physical world and is used in various industrial and social issues and manufacturing to improve operational performance and business processes significantly. Digital Me is a symbolic expression of ego in a digital world that is different from the actual self. Conceptually, the digital twin is different in that it objectively interprets the real self, whereas digital Me interprets it subjectively. In terms of application, digital twins are used to solve current problems and simulate future outcomes. Digital Me, on the other hand, is a surrogate self that projects one's self that cannot be done in real life.

D. EXTENDED REALITY (XR)
In terms of technology, XR is related to VR (virtual reality), AR (augmented reality), and MR (mixed reality). VR used to act as an avatar in a digitally implemented three-dimensional world (e.g., ZEPETO). VR provides an experience as if you were in a specific place without physical limitations, helping you learn about the ideas you can get from experiencing different places. While VR is a technology that allows a new reality to compete based on 360-degree images, AR is a method of superimposing virtual objects on real space from a first-person perspective (e.g., Pokemon Go). AR overlays computer-generated images, sounds, 3D models, videos, graphics, animated sequences, games, and VOLUME 10, 2022 GPS information into real-world environments [66], [67]. Visually search for objects and adjust interfaces by overlaying visually immersive content in the real world. In particular, it has the advantage of clearly providing information and visualizing controllable devices without an additional screen.
MR, which integrates these two concepts, is a mixed reality technology that integrates VR and AR. MR is the concept of creating virtual objects that allow users to interact with the 3D environment in the immersion of the virtual environment of VR and the overlay of virtual content in AR. AR provides a more realistic solution because the hardware is relatively simple, like glasses, and reflects reality well, but it is suitable for short content [68]. On the other hand, VR covers the entire field of view, has an immersive feeling, and is suitable for long-term content but entails physical fatigue. In some cases, MR, which uses a mixture of these advantages and disadvantages, is being considered as a solution that can be converted to AR and VR with a single device. XR is an extended reality, which is terms used to include VR, AR, and MR. XR is used for virtual commerce or v-commerce to create computer-mediated indirect experiences [69].

III. METAVERSE COMPONENTS
Metaverse gives patients an immersive experience enough to be used in psychotherapy. People know that myths and novels are not realistic, but they are moved. Similarly, Metaverse is not the real world but can provide a tangible feeling, so services based on immersive user-interactive stories can provide. A representative example of such an approach is a game based on two-way interaction. In order to service the Metaverse like the real world, it is necessary to be able to interact seamlessly and concurrency in an environment with presence. In order to maintain a sustainable Metaverse, economic activity between users based on these interactions must continue. We describe Metaverse into hardware, software, and contents from the component's point of view in this section.

A. HARDWARE COMPONENTS (PHYSICAL DEVICES AND SENSORS)
Hardware in Metaverse not only plays an important role in the immersive experience but also is a technically limiting barrier. In the Metaverse, hardware is quickly enhanced by the effects of technological advancement, but it still needs improvement compared to the experience of the real world. The essential hardware of Metaverse is an HMD that blocks the view to enable immersive participation. For a more effective visual experience, Birnie et al. [70] proposed a fovea rendering method that maintains the central part in high resolution similar to human vision. Critical factors for physical devices and sensors are resolution, the field of view size, and latency. Among them, the most important characteristic is latency, which plays an important role in multimodal interactions, so it should be designed considering the threshold for side effects and time gaps.

1) HEAD-MOUNTED DISPLAYS (HMD)
The HMD shows an image through the display and plays the role of playing the sound through the speaker [8]. HMD is a basic input tool of Metaverse and is divided into Nonsee-through HMD, Optical-see-through HMD, and videosee-though HMD [71]. In the case of a method that covers the screen, it provides a sense of immersion in a completely virtual world. Optical-see-through (mainly used in AR) is a method of overlaying the virtual world, and high hardware specifications are required in the process of overlaying. To complement this method, video-see-though HMD is used. These HMD issues are the bulky, expensive, and short battery life of the headset. HMD tracks position and orientation according to the movement of the head and delivers the same change of view as in the virtual world by moving the screen. It is more inaccurate than the method of estimating motion by external measurement due to problems with accuracy and delay time, but it is widely used because it can save space and cost. 4216 VOLUME 10, 2022 FIGURE 4. The example of circular coordination and area for hand-based input device [72].

2) HAND-BASED INPUT DEVICE
Diverse circular coordination and input area are proposed for hand-based input devices as shown in Fig. 4 [72]. Detailed user data modeling (e.g., mobile phone grip prediction) is required to provide feeling the material with tactile. Haptic has a passive haptic that gives the texture of real objects and an active haptic that creates virtual pressure. Passive haptic is used to help understand the situation while giving presence, and active haptic is used for more effective interaction by adjusting and delivering according to user feedback. Using real props (e.g., physical degree and operational degree) in a virtual environment helps the user experience, while using a robotized interface allows for more diverse interactions [73]. Depending on the device's installation, it is divided into the case of being attached to the hand and the case of being attached to the outside. Beyond making the material feel, it is used in various forms (e.g., inducing muscle tension).

3) NON-HAND-BASED INPUT DEVICE
As auxiliary input means, there are eye-tracking, head tracking, voice input device, and so on [74]. Eye-tracking is a method of changing the viewpoint by predicting eye movement when the user moves their eyes without turning their heads. It is a technology that allows the user to see what kind of object the user is paying attention to. It has the advantage of reducing the load on image processing by generating high-resolution images in the section where the user is focused on a phobia method. The method of overlaying the display on the arm is more stable than the method in the air by repeatedly providing the display at a location predictable by the user [70]. Voice input has an advantage in processing long texts and conversations in a virtual keyboard and an environment where input is limited.

4) MOTION INPUT DEVICE
In order to effectively use the physical sense of space or gravity, body tracking and treadmill are used to provide accurate motion information with auxiliary devices. Motion input devices are also divided into a passive method and an active method. The passive method is a method of delivering a sense to the user with a fixed scenario, and the active method is a method of providing appropriate feedback based on the user's behavior. It is used in various forms to give realism, from a simple way to walking to a 360-degree rotation. There is a risk of injury to the user, so a method of fixing the waist is used with a treadmill.

B. SOFTWARE COMPONENTS (RECOGNITION AND RENDERING)
A cognitive illusion plays an essential role in immersion in the objective reality of the physical space and the subjective reality that users feel. There are two types of cognition: static cognition and dynamic cognition. Static cognition is the proprioceptive senses (e.g., sight, hearing, and touch), while dynamic cognition is sensory balance and body movement [75]. In dynamic cognition, adaptation, attention, and behavior are important features.
According to the object of cognition, it can be divided into the cognition of environment and cognition of an object. In particular, in Metaverse, it is important to reduce the distortion of detection and recognition. Methods for mitigating distortion include changing the shape of the kernel, changing the expression, and increasing the input. Objects of object recognition include faces, poses, gestures, and gazes related to the body. Such object recognition goes through the process of sensing, recording, recognizing, and tracking.
There are two types of stimulation: remote and proximity stimulation. There are bottom-up and top-down approaches to perceiving stimuli. A concept of perception that is distinct from this intuitive sense is also needed. The unconscious VOLUME 10, 2022 FIGURE 5. Scene rendering for visual language navigation with three-dimension [75].
approach and the conscious approach are classified according to the presence or absence of a difference in movement according to repetitive recognition. There are instinctive, behavioral, contemplative, and emotional processing methods.
The avatar is an important entity in the Metaverse, and the avatar is created, and the action is imitated using animation. Vision-based models estimate human poses, recognize hand gestures and predict gaze. To predict the gaze, iris, facial contour, and 3D gaze prediction are used.

1) SCENE AND OBJECT RECOGNITION
Object recognition is the process of recognizing the size, shape, position, brightness, and colors of objects according to distance. For scene recognition and object recognition, novel methods (e.g., modal alignment, cross-modal attention, point cloud, and scene graph) are used as shown in Fig. 5 [75]. Scene recognition is a good recognition of what state the current scene is and what components and configurations it has. In sub-graph-based scene graph generation, a method of clustering object pairs into graphs by clustering and sharing representations is used [76]. Scene graphs are a good approach to complement the explainable properties that have emerged as limitations of neural network models. Some studies use generative methods and scene graphs to classify bodies in overlapping situations and predict human postures behind walls.
Object recognition is also important along with scene recognition, and we have to pay attention to humancentered scene analysis and non-contact interaction (e.g., gaze, gesture, pose). When many objects are recognized using individual object detection, the number of computations increases in proportion to the number of objects, so an attempt is made to reduce the computational burden by using an abstraction concept. In particular, some studies (e.g., world models and MONET) abstract multiple objects into representations for fast object recognition and efficient training [77].

2) SOUND AND SPEECH RECOGNITION
Recognizing sounds and processing speech help understand surroundings and communicate with other avatars. The conversation is a direct method of communication with other avatars and giving instructions to NPCs in Metaverse. As the Metaverse connection is made in various environments, it is necessary to have a technology that separates the surrounding noise and one's own voice without noise. In addition, the loudness of the sound according to the distance is a variable. For a realistic environment in the Metaverse, voice recognition technology is needed that considers the surrounding environment while adjusting the volume according to the distance.

3) SCENE AND OBJECT GENERATION
The method of generating the environment and objects in Metaverse is divided into the method of depicting by reflecting the real world and the method of creating a new imaginary environment. A realistic way to reflect the realworld environment is to reproduce famous places (e.g., museums, Eiffel Tower) and places familiar to individuals (e.g., home, school) in the real world. Alternatively, it creates a hard-to-reach environment (e.g., underwater, Mars) to provide a surreal experience. People and things are the main objects of object generation. Object generation modules create an avatar and NPC of any desired human shape (e.g., a celebrity, a family member) as an object of conversation. It focuses on facial expressions and natural movements of joints for fluent multimodal conversation. On the other hand, it generates realistic objects that express in detail enough to feel the texture of objects that exist in reality. On the other hand, another type of object is imaginary animals (e.g., unicorns, dragons) and anthropomorphic objects (e.g., talking chairs) that do not exist.

4) SOUND AND SPEECH SYNTHESIS
Sound synthesis is a field that gives the user a sense of immersion, but research is insufficient compared to vision. It creates a sound in the space to give a feeling of presence in the field and to increase the sense of immersion. In particular, a voice suitable for each character is an important means of expressing the character's persona. Tacotron, a speech synthesis, focuses on that users can use prosody to emphasize words or express uncertainty [78]. Prosody is the variation of the speech signal that remains after taking the variation into account (e.g., phonetics and channel effects), which captures meaningful utterances and transfers them by subtractive methods [79].

5) MOTION RENDERING
CNNs and global context encoding are used to capture asymmetric dependencies and context patterns between objects in real-time multi-party 3D motion capture and 4218 VOLUME 10, 2022 pose estimation [80], [81]. The graph reflects the structural characteristics of the body to interpret the action meaning more accurately when the human body is superimposed. Although it is possible to capture the real-time 3D motion of difficult scenes with a single-color camera and isolate human body structures (e.g., shaking hands), it is still limited in capturing close interactions (e.g., hugs).

C. CONTENTS (SCENARIO AND STORY)
Content is the fundamental component that maintains the Metaverse and is used to provide an immersive experience through well-organized stories and user-created events. In content, story reality, immersive experience, and conceptual completeness are important. There are two ways to create content, a paradigm shift method and a method to reuse existing content. The areas that require environment design are scenes, color and lighting, audio, sampling and aliasing, environmental navigation, and real-world content. User motions, characters, and the persona of avatars affect behavioral modeling.
Wang et al. [82] introduced studies to process panoramic images and videos in virtual 3D scenes using CNNs and GANs to generate and explore VR content. The generated sentences and images become more natural than before, but sentence patterns with similar meanings are sometimes repeated and evaluated superficially. Also, the longer the sentence, the less the concentration and consistency of the overall content composition. A structured approach (e.g., graph network) is proposed to keep the scenario cohesive and enrich the story's details.

1) MULTIMODAL CONTENT REPRESENTATION
In Metaverse, users create large amounts of multimedia content (e.g., images and videos) as well as text via an avatar. The multimedia data generated in this way expresses the user's thoughts and experiences more than simple dialog. In order to effectively handle multi-modal content, there is an alignment method that converts data into different modal types and a method of expressing data of different modal types by integrating them into one representation [83]. Multimodal content enriches the content by adding information from the data of other modals and supplementing the lack of information of unimodal. By learning these cross-modal features, there is an advantage that intra-mode and intermodal semantic relationships are utilized.

2) AGENT PERSONA MODELING
In the Metaverse, multi-agents need to have different personas, as if each person has a personality, and multiple agents can interact with in different ways at the same time. It is difficult to give the user a sense of immersion with a character who has a similar conversation every time.
Metaverse needs a persona model that expresses various multimodal expressions (e.g., gestures and facial expressions) as well as conversations (e.g., Persona chat). Since spoken language understanding (SLU) uses information on persona without pitch lost in the process of converting a voice signal into text, it grasps a more accurate meaning that is not in the cascaded conversion models.
Although Metaverse users create a lot of user data, entity augmentation and persona generation are important because the data required for learning is relatively large to avoid a cold start problem and the sparsity of various NPC (non-player character) personas. In particular, unevenly user-generated data (e.g., conversation history and personal experience) is biased towards a particular subject until enough data has been gathered.
An entity is a uniquely identified unit (e.g., the name of a famous place and person) and is associated with other entities and relationships. Entity-based expansion is a way to enrich user personas by increasing the number of entities. Methods for increasing the number of entities include generative models, reinforcement learning, joint inference, ontology, and multiple entity extension methods using intermodal [84]- [93]. When tagged resources are scarce, there is a way further to extract entities through co-learning with other model data. Using the pre-trained model is a good way to extend the entity based on the balanced data.
When creating Metaverse NPCs, novel approaches are needed to express persona and emotions that reflect the characteristic of worldview. Therefore, considering the scarcity of each modal, a data population is required to create a balanced persona used in various scenarios. Personas play an important role in giving users a sense of immersion by giving each character a personality in the Metaverse. In particular, it is necessary to implement characters with multi-personas like humans.
When constructing a conversational model, agents responded based on training data for correct answers rather than personality. Because such monotony makes it difficult to maintain long conversations, some researchers proposed to maintain a consistent conversation pattern by introducing the persona concept. For the dialogue system to sustain longer and more human conversations, empathic dialogue systems consider personas [94], [95]. However, dialogue data is insufficient to create large persona representations. Personal Dialog is a large multi-rotational dialogue dataset based on sequential conditional GANs containing different characteristics of different speakers (e.g., age, gender, location, interest tags) [96]. Textual story creation focusing on persona is also proposed [97]. The conditional language model generates various forms (e.g., wiki, horror, humor) of sentences from prefixes without retraining [98].

3) MULTIMODAL ENTITY LINKING AND EXPANSION
When defining characters and entities by transforming the modal of various data, it is necessary to redefine and extend the relationship between the contents through the connection between entities. In expressing the growth process of diverse events and characters in the Metaverse, causality is important in understanding the events and connecting them to the story. VOLUME 10, 2022 Entity linking is the process of linking related entities based on the similarity between entities and the probability distribution of related contents. Methods for connecting entities by structural learning include link prediction, nonlinear relationships, joint inference, and relationship classification methods [99]- [108]. In particular, the graph model shows how they are related to each other by using the relationship between entities, which are units of meaningful information. The connecting relationship is improved by explicitly modeling the interdependencies between object instances in the scene graph [109]. It also improves the performance of relational inference by encoding global context and geometric layout. Research on graph models and graph convolution networks is increasing to extend the links of stories [110]- [113].
Since the Metaverse contains various worldviews, simply connecting objects is not enough, so the process of expanding and inferring links between entities is required. Inferring information based on given data is an important issue in enriching content. In particular, facts are derived by connecting hierarchically connected things based on causal relationships. Some studies used inference methods include variation inference, various modalities, ontology, emotion, and knowledge [114]- [121].

4) SCENARIO GENERATION
Rather than listing events in the Metaverse, it is important to find hidden relationships based on causal relationships between events and themes and construct a scenario line based on them. Unlike the text-based scenario, the Metaverse is more complex because it has to be configured in multi-modal and embodied environments. Each entity and relationship are used to organize events, and events must be organically combined to form scenario lines. Scenario lines construct the overall structure and serve as an index linking each event. Because it is not just a list of events, the entities and relationships in each event are linked together based on long-term dependencies.
In order to compose a scenario line, it is necessary to connect events composed of entities and their relationships using a graph model. Events are divided into main events and subevents according to their importance in the progress of the scenario. Scenario construction methods include continuous sequences, hierarchical structures, and the attention-based method by focusing on noteworthy content [122]- [132].
When user behavior data in a scenario graph is accumulated over the lifetime of the avatar, it is extended to the concept of life logging. Key scenario topics are extracted with topic modeling and summarized personalized multimodal user data with generative language models. Yu and Riedl [123] introduced a drama manager who personalizes user stories with plots and optimal sequences. Bolanos et al. [133] described the visual life logging of storytelling by time slicing, summarizing, and retrieving important information. Li et al. [134] proposed StoryGAN, a story-image sequence generation model that sequentially visualizes stories by generating one image sequence for each sentence.

5) SCENARIO POPULATION
Scenarios expand by adding entities and linking the added entities with relation. Scenario lines form a skeleton and expand entities and links to events to create rich stories. Connections between events and other events are formed by relationships and are linked within a scenario. Entity expansion methods include translation embedding, attention, bidirectional inference, and relational inference [135]- [140].
In the process of scenario graph population, modal conversion (e.g., text-to-video and video-to-text conversion) is used for multimodal integration. After pairing sentence nodes with images in a hierarchical approach, it adjusts the length of events through event summarization. The generated multi-mode scenario graph can be used to expand or collapse events. Each event is summarized as representative images with a multimodal language model.

6) SCENARIO EVALUATION
In an event-based extended scenario, as the scenario lengthens, inconsistencies between events occur, it is necessary to verify periodically whether the scenario does not conflict in concept. By instantiating the scenario graph, it hierarchically enlarges and contracts to verify that each event is organically connected and there is no contradiction. Scenario verification is divided with a synthetic method based on grammar and a method directly verifying visualized graphs using human-defined metrics [141]- [146]. Human-defined metrics are divided into a structural approach and a search-based approach. The structural approach evaluates the overall composition in which the scenario is balanced, while the search-based approach looks up specific facts with user queries to ensure that the scenario is well-formed without contradictions.

D. DISCUSSION AND OPEN CHALLENGE
Users can also suffer from simulated motion sickness (i.e., cyber motion sickness) due to the imbalance of visual information obtained from human organs and eyes. There are focusing-displacement collisions and binocular-occlusion collisions, which may have side effects (e.g., blinking). There are other issues (e.g., physical fatigue, headset weight, movement injuries, hygiene issues from prolonged wear) and some side effects (e.g., thin motion sickness, vector motion sickness, eye fatigue, and seizures). In order to reduce side effects, postural stability and physiological measurement methods are used to measure the degree of motion sickness. In addition, adaptive optimization based on the measured values and stabilizing using stable cues have been proposed. Beyond that, there are alternative methods to minimize leading indicators, visual acceleration, and rotation.
Cognitive stability and homeostasis are important for effective service in the Metaverse. Recently, in order to support a more realistic sense, the scope of Metaverse has been expanded to smell and taste perception. On the other hand, interest in recognizing a complex sense by combining these senses is also growing.
In order to process large amounts of real-time data, fast rendering and data analysis are required immediately. The speed of image processing is essential because the 360-degree field of view is taken into account. Therefore, it is necessary to reduce the delay time through the expected tracking and measurement when rendering the object.
Users are able to decide whether they want to organize their scenarios in a simple, concise summary format or as events in long, complex plots. The depth and length of the scenario are determined using modal and density transforms. Metaverse scenarios use hierarchical and causal relationships to organize events and adjust their length using techniques that adequately summarize the content of a sentence or paragraph. The resolution of the text can be interpreted in terms of the summary, the summary of a scene can consist of a panorama with multiple scenes connected, and it can represent an important timeline among multiple scenes [147]. In terms of scenario construction, studies to find connections between scattered entities include clustering [148], planning-based conditional branching [149], time-dependent index [150], and visual analysis method [151].
The completeness of the content is also an important factor in the Metaverse. For example, in Ready Player One, many interesting characters that dominated popular culture appeared, creating the illusion of going back to that time. However, the story development and probability were weak compared to the splendor of the visual.

IV. METAVERSE APPROCHES A. USER INTERACTION
Natural interaction is an essential condition for increasing immersion in the Metaverse. It can reproduce the faces of friends and celebrities to enable realistic interactions and to instill the illusion of users with familiar and famous places. Temporary dissociation, concentration, and heightened enjoyment are important factors in the interaction, and emotions of control, curiosity, and intrinsic motivation are used. The target of interaction is mainly human, and hands are an important feature. Input devices are broadly divided into hand-held devices and non-hand input devices. Fidelity, proprioception, and egocentric view are important for interactions on physical devices. Since a 360-degree field of view is used as the receptive field for spatial recognition, a lot of images and distortion corrections are required for video processing efficiency. In order to reduce motion sickness and fatigue, visual and bodily sensory collisions and an alternative sensory method are needed. It also requires multimodal sensory perception that handles speech, gestures, and dialog flows.

1) LANGUAGE INTERACTION
The conversation is a basic approach to deliver user intent via voice recognition. In other words, language is used in various places because it concisely describes complex situations in an implicit sense. It is necessary to create a Metaverse environment in which understanding the situation through language, abstraction, QA, and translation. Miller et al. [152] proposed ParlAI, an integrated framework for training and testing conversational models using multitasking training, data collection, human assessment, and online RL. Par-lAI performs various tasks in the same interface with dialog datasets (e.g., SQuAD, bAbI task, MCTest, Wik-iQA, QACNN, QADailyMail, CBT, bAbI dialog, Ubuntu, OpenSubtitles, VQA).
Languages are used in the RL domain as an effective way to define goals and abstract human-comprehensible tasks [153]- [155]. Some agents classify instructions into a single skill level by mimicking human behavior [156]. When the agent is faced with an ambiguous situation, the agent clarifies the instruction intention through a multi-turn conversation with the Oracle [157]. Jiang et al. [158] used a language that is flexibly applied to the generalization of various goals, rapid training, and combinations as an abstraction to solve the difficulty of generalized abstraction in hierarchical RL.
AQM (Questioner's Mind) agents ask more consistent questions to maximize information acquisition in taskoriented conversational systems [159]. Knowledge Graph A2C (KG-A2C) is a scalable exploratory method for inferring game states in a template-based workspace using linguistic behaviors and dynamic knowledge graphs [160]. Translation is an essential method in the Metaverse environment where people of various languages are gathered. Domhan et al. [161] proposed joint training on a large number of unpaired languages and a small number of language pairs to improve neural machine translation (NMT) performance.

2) MULTIMODAL INTERACTION
Humans facilitate efficient adaptation and reason more abstractly by transferring knowledge across tasks. People communicate not only dialogue but also based on multimodal information (e.g., facial expressions, gestures, and tone of voice). The method of handling each modal is difficult to handle multiple complex emotions, so multimodal interaction is required. In general, multimodal has more information than unimodal and is advantageous for understanding the situation. Text and images in social media posts do not have the same meaning but instead have more complex meanings that intersect semantically [162]- [164]. In particular, multimodal learning is most effective when the meanings of images and text are different.
After the advent of Transformers, studies have been conducted to learn vision and language together and reduce learning from scratch using a pre-trained model. Zhou et al. [165] proposed a unified vision language dictionary training (VLP) model using a shared multi-layer transformer that fine-tunes vision language generation.

3) MULTI-TASK INTERACTION
Since the Metaverse handles many things in the cyber world, a model that handles multiple tasks simultaneously is useful in the aspect of complexity. For such a model, knowledge distillation is used to make a small model that performs many functions and handles other modal types (e.g., Visual QA). Hessel et al. [166] argued that multitasking is more complex than single-tasking because the multitasking model balances various tasks in limited expression. It is relatively easy to use for similar tasks but easily overfits when target domain data is scarce and has a different distribution.
E2E methods are also used to perform various tasks effectively. Translatotron [167] translated from voice input to voice output through a sequential process. Compared to the cascaded model, the E2E model has the advantage that most of the inputs can be utilized without data loss in the process. Translatotron interprets a foreign language, including its unique pronunciation and emotional meaning. Also, it has the advantage of responding in a voice form that reflects the prosody of the actual speaker. Qian et al. [168] proposed an E2E modeling method SLU for a cloud-based modular dialog system (SDS), showing that it is effective in situations with low ASR accuracy.

4) EMBODIED INTERACTION
The difference between the Metaverse and other general interactions is that the proportion of embodied interactions (e.g., embedded QA and visual language navigation) is relatively high. While the required skills are similar to EQA and VLN, there is a difference in whether the subject is active or passive. While the purpose of VQA is to answer text questions about a given image, EQA (embodied question and answer) performs the task of analyzing sensor information obtained by an agent materialized through active exploration. For example, to answer a question about the color of a car at a distance, the agent actively moves, recognizes, and responds based on prior knowledge of the car's location and path [169]. These EQA tasks have recently been extended in the form of conversations, where agents compensate by querying oracles for insufficient information to perform the task [157].
The factor that differentiates embedded interaction from 2D-based methods is Exophora resolution. Anaphora resolution is the task of analyzing the word in the preceding sentence pointed to by a pronoun [170], [171]. Anaphora and co-reference resolutions are used to infer cross-references in questions and conversations [172], [173]. In short sentences, implied conversations, anaphora resolution is needed to understand the context of the conversation. Recently, such anaphora has been widely used in multi-modal content (e.g., video) and SNS services (e.g., Twitter) beyond simple sentence-based analysis [174], [175]. Exophora resolution maps the meaning of Co-reference resolution and Anaphora resolution used in language to 3D space.
People communicate information in a non-verbal form by pointing to an object instead of language. When a user points to a specific location through a finger, it becomes an intended instruction. In the case of exophora resolution, specific instructions are performed in terms of multimodal interaction, including motion and speech, whereas anaphora simply links meaning between texts. Heinrich et al. [176] proposed Embodied Multi-modal Interaction in Language learning (EMIL), a neurocognitive model that reflects in vivoinspired mechanisms (e.g., an implicit adaptation of time scales).

B. METAVERSE IMPLEMENTATION
The process of Metaverse implementation is divided into a design phase, a model-training phase, an operation phase, and an evaluation phase. The design phase considers goals and concept design, development time and cost, risk estimates, constraints, user scenarios, scope and requirements, and feasibility of implementation and evaluation. In the modeltraining phase, data analysis, user modeling, scientific methodology, iterative learning, and parameter tuning are performed. The operation phase considers system considerations, simulations, job scheduling, network environments, and prototype demonstrations. The evaluation phase deals with content fidelity, the authenticity of interactions, implementation feasibility, and failover.
This survey covers three types of multimodal inference, RL-based approaches, and lifelong learning for Metaverse training models. In addition, it is necessary to consider multiagent optimization, integration optimization, and operational considerations from the perspective of Metaverse service operation.

1) MULTIMODAL INFERENCE
Humans do not only interpret the meaning of utterances when communicating with others. When information is given from the cognitive model, it interprets its meaning, combines it with its knowledge, and inferences its intentions. Verbal ambiguity is compensated to determine the speaker's underlying intentions based on direct or indirect representations of the surrounding environment. For example, emotion recognition, the initiator of emotional interaction, uses multimodal fusion to compensate for the lack of context in textual information [177]. Multi-modal models do not always outperform single-mode models, so they should be utilized according to the situation. Zhang et al. [178] used late fusion to explicitly examine the impact of each function by considering three types of visual, spatial, and semantic. Liang et al. [179] proposed Multimodal Local-Global Ranking Fusion (MLRF), relative sentiment analysis for complex combinations of visual and acoustics. Rather than simply classifying emotions as scalar values, the ranking was performed after measuring the degree of increase or decrease in emotional intensity for partial video segments.
The advantage of the pre-trained model is that it simplifies the task with E2E and does not have to learn from scratch.
Recently, DialogGPT and Vlbert are proposed to implement dialog and visual-language tasks more conveniently. Largescale pre-trained language models (PLMs) (e.g., Bidirectional Encoder Representations from Transformers (BERT), GPT-3) are used for downstream tasks by applying finetuning and few-shot learning [180]. Sun et al. [181] propose MobileBERT to compress and accelerate the BERT model. MobileBERT uses the knowledge transfer student model from the teacher BERT large model. Brown et al. [182] proposed GPT-3, a 175 billion parameterized autoregressive model that applies several pieces of training runs without any gradient updates and fine-tuning for downstream tasks.
Tan et al. [183] proposed contextual mapping of language tokens and associated images with vokenization and multimodal alignment. It is applied to a relatively small image caption dataset using the generated model. T5, which integrates text tasks into one model, is proposed to handle translation, question-and-answer (QA), etc. [184]. Video pre-trained models (VPMs) that contain multimodal data of vision and text are effectively used in low-complexity downstream tasks (e.g., VideoBERT [185], ViLBERT [186]). VPMs are used for answering visual questions, common sense reasoning, reference representation, and caption-based image retrieval. BERT-based VPM performs vector quantization of video data and trains bidirectional joint distributions for visual and verbal token sequences.
Some studies give a sense of presence in Metaverse. Domain knowledge understanding provides a more detailed response based on facts. Spoken language understanding deals with the user's tone and emotions. Acoustic signal understanding recognizes and generates sounds from the surrounding environment. Reasoning (e.g., multi-hop reasoning, relational reasoning, and graph reasoning) derived new facts through prior knowledge, background knowledge, and environmental factors given in the current situation.
Multi-hop inference in graph neural networks (GNN) has been used to generate new knowledge about vision and language [187]- [192]. Because graphs, along with KB, act as a repository of knowledge, it is important to effectively utilize encoding, sampling, and utilization in visual language interactions. GCN is a representative model for training representations of attribute graphs. Graph inference trains fixed representations of entities in multiple relational graphs, which are generalized to infer invisible entity relationships during inference. Various approaches are proposed to improve graph reasoning [193], [194].

2) RL-BASED APPROACHES
Multi-agent RL, Imagination-augmented RL, and Languagegrounded RL are utilized in Metaverse because RL is suitable for action in a situation without prior learning. Multi-agent RL provides realistic NPCs by causing collaboration and disputes among various agents. Imagination-augmented RL has the feature of rapidly stabilizing without enormous training data, and language-based RL is used for conversation.
Technically, RL is a method to achieve an objective goal by determining the behavior that will receive the maximum reward based on the state received from the environment. It is divided into model-based RL and model-free RL according to the existence of a model for a task. It is also divided into a value-based method and a policy-based method according to the training method. The on-policy method trains an algorithm using the deterministic output of the target policy, whereas the off-policy method indirectly creates and trains a stored distribution. Compensation methods (e.g., episodic memory, world model, and language-based RL) have been proposed to solve the problem of inefficiency and sparse rewards of RL sampling. Furthermore, more efficient approaches (e.g., offline RL and control RL) are emerging to solve fundamental problems (e.g., sample inefficiency, unstable training). Unlike traditional off-policy RL and model-based RL, offline RL uses only pre-collected training data, not online results. Offline RL shows reliable learning with batch training and good performance in a closed-loop environment.
RL methods are steadily growing through knowledge sharing, memory, abstraction, and language bases. The Diversity all you need (DIAYN) model learns useful skills without a reward function, just as humans navigate the environment without supervision [195]. DIAYN acquires skills by maximizing information-theoretic goals using a maximum entropy policy. Laskin et al. [196] proposed Contrastive Unsupervised Representations for Reinforcement Learning (CURL) that utilizes the advanced capabilities of raw pixels using contrast learning and out-of-policy controls. Xavier et al. [197] proposed a watch-and-help (WAH) model that uses a single demonstration of an agent performing the same task to understand the task's goal and work with a human-like agent to solve a problem.

3) LIFE-LONG LEARNING
Life-long learning is meaningful because it builds experience points over a long period in a sustainable Metaverse. For such life-long learning, a method that effectively memory existing data and use it at an appropriate time is required. Most solutions and services have a constant cycle. In order to apply lifelong learning to Metaverse, it is necessary to consider how to maintain long-term service. On the other hand, the key to life-long learning is how to handle the catastrophic forgetting that most neural net models have.

4) MULTI-AGENT OPTIMIZATION
Relationships between multiple agents are divided into collaborative, competitive, and oracle relationships. In order to effectively utilize these relationships in multi-agents, it is necessary to introduce a mental model (e.g., the Theory of Mind (ToM), intrinsic motivation, and heterogeneous competition). Based on the concepts and experimental results of psychology and neuroscience, there have been attempts to solve the problem of neural networks. In particular, the theory of mind, inductive bias, and intrinsic motivation were effective methods in embodied visual language interaction [198]- [200].
Will et al. [201] used dopamine's reward prediction error theory to explain rich empirical phenomena. They provide an integrated framework for understanding the representation of rewards and values in the brain. They describe that the brain represents possible rewards as a probability distribution rather than a single scalar, and various future outcomes are spontaneously expressed in parallel. Episodic memory tracks functional and structural interactions between brain regions, particularly the hippocampus [202]. Episodic memory, unlike semantic memory, is a descriptive memory that contains information related to the time and place of acquisition. Gradient episodic memory reduces forgetting by transferring previous knowledge to evaluate model training on continuous data [203]. Oudeyer et al. [204] explained how psychology and neuroscience conceptualize curiosity and intrinsic motivation as intrinsic rewards for the brain's novelty, complexity, and information scale. Rabinowitz et al. [205] designed a mind neuron theory network, ToMnet that uses meta-learning to observe behavior to build agent models. Melhart et al. [206] investigated how the emotional mind theory of gameplay influences behavioral recognition, performance, and frustration behavior in facial emotion recognition tasks.
Cooperative multi-agent RL requires a distributed policy but has limitations in coordinating agent behavior in complex environments. Agents reconstruct each other's observations to generate common knowledge in distributed, collaborative multi-agent operations. It is also essential to study how to believe from the perspective of other agents and humans through collaboration [207]. Humans share potential minds (e.g., beliefs) and these social methods are important for recursive reasoning about the potential consequences of other avatar actions.
In order to effectively operate multiple agents, various optimization methods are proposed. Gated propagation networks improve training with attention and gating on graphs that propagate messages between prototypes of different classes and update them in memory of different classes [208]. Multimodal MAML modulates meta-trained prior parameters to enable fast adaptation and improved training on multimodal distributions [209].

5) INTEGRATION OPTIMIZATION
An integrated platform is needed to handle various modals and various events and interactions. Racanière et al. [210] proposed an I2A (Imagination-Augmented Agents) for deep RL combining model-free RL and model-based RL. ZEPETO is a platform that is completely provided in the form of a service, and Unity provides more freedom in which developers create the world they want.

6) OPERATION CONSIDERATION
Continuous service through human-centered design and multi-modal interaction is important from an artistic point of view and a scientific point of view based on design philosophy. Meta RL based on few-shot learning is used because real-time performance is poor to analyze service. Graph RL using the structural characteristics of knowledge is also attracting attention. Because planning is essential to perform more complex scenarios, there are many studies on Planning RL. In order to provide stable service on the integrated platform, it is necessary to cope with network bandwidth and failure response physically. In addition, measures against social and politically sensitive issues (e.g., sanctions and hacking) are required.

C. METAVERSE APPLICATIONS
Most of the research on Metaverse is aimed at marketing and investment purposes, emphasizing social utility. The domains where Metaverse is popularly serviced are games and some office applications. Huggett [57] argued that there is a separation between the present reality and virtual reality of virtual heritage and conducted a study of existence and realism within virtual reality. Skarbez [211] introduced mixed reality, real-world modeling, and real-world modeling. For better Metaverse applications, an approach is needed to model and distinguish the differences and the same points between virtual reality and reality.

1) SIMULATION
Metaverse is being serviced in various forms of application. The simulation starts with a game and is also used for social phenomenon research and marketing simulation. Because it has an educational effect through simulation, it is also used for education and museum visits. Simulations depicting realworld tasks are a universally available application in the Metaverse. General simulation is solution-dependent, but the simulation of Metaverse is performed in Metaverse, so it is different from general simulation. Maharg and Owen [212] conducted a study on application simulation for educational purposes. Siyaev and Jo [59] conducted a study on virtual assets and workflow control using aircraft engineer voice commands. In the case of a virtual environment based on the real environment, exaggeration and the intention of the creator can be included in the process of describing the environment in Metaverse. Shi et al. [213] studied the difference between the virtual and real environments by evaluating the agreement between the field survey and VR on the landscape.
Gordon et al. [214] proposed a hierarchical interactive memory network (HIMN) consisting of a factored set of controllers and operating at multiple levels of temporal abstraction. They also introduce IQUAD V1, which simulates realistic environments of indoor scenes that can be configured with interactive objects. Qiu et al. [215] proposed an objectdriven visual search algorithm, MJOLNIR (memory-utilized co-hierarchical object learning for indoor room navigation), that learns how to associate objects with prior knowledge. Li et al. [216] proposed a MIND (Mental Imagery eNhanceD) module to model the dynamics of the environment and 4224 VOLUME 10, 2022 create objects for a better understanding of the implemented agent. Tamari et al. [217] described that natural language in cognitive linguistics (ECL) is inherently executable and driven by metaphorical mappings and mental simulations to schemas learned through hierarchical organization and interaction.

2) GAME
Games are the most common platform in the popularization of the Metaverse. In addition to simply focusing on interest, there are ways to approach to simplify difficult tasks through games. As much as payment and personal information are widely used in Metaverse, a game based on blockchain technology has been proposed [10]. Hide and Seek is a simple yet effective simulation environment for multiagent work that uses visual representations of objects and scenes from an egocentric perspective [218]- [221]. Baker et al. [218] found that agents create a self-supervised automation curriculum that drives new strategies of multiple stages in a multi-agent competitive environment (i.e., hide and seek). Stanica et al. [221] Introduced Neurorehabilitation Exercises Using Virtual Reality (INREX-VR), an immersive neurorehabilitation system using virtual reality. They capture real-time user movements in gamified environments and execute complex movements to encourage self-improvement and competition.

3) OFFICE
In order to supplement the sense of space lacking in online solutions in B2B solutions and conferences, some companies introduced and supplemented the offline concept. In this way, the sound occurring in the office and physical elements (e.g., desks and conference rooms) is given a sense of space. Representative examples of office applications include solutions (e.g., Branch, Gather, and Teamflow) and use spatial audio technology to provide speech and footstep sounds according to distance. The Branch is given a game element that offers virtual currency and experience. Teamflow has the advantage of using work-related tools (e.g., file sharing in conjunction).

4) SOCIAL
Because avatars change skin color and gender as desired, they have the advantage of reducing preconceived notions about social discrimination in conversations. These embodied avatars are more advantageous for simulating social problems than in the form of surveys and role-play. Papagiannidis et al. [222] conducted a study on the impact of corporate social responsibility focusing on ethical and policy-related issues. De Decker et al. [58] introduce the study on the process for solving complex social problems was conducted using Metaverse. Smart et al. [223] explained the important characteristics of social change in the Metaverse and future opportunities.
The online requirements for cultural life (e.g., museums and performances) are gradually increasing. Although the limited capacity and time constraints of an offline concert hall are solved, there is still a lack of differences in texture and fine detail that can be felt offline. Tang [224] evaluated the immersive service using Metaverse for educational library orientation. Choi and Kim [45] studied how visitors experience museums by combining beacons and HMDs. Hazan [24] explored how museum social and cultural experiences are evolving.

5) MARKETING
Economic activity is an important content in the Metaverse. It creates an ecosystem that continues economic activity by consuming clothes and goods provided by the production company and producing and selling them with other users. Metaverse is a virtual world to predict the future by reflecting the characteristics of reality realistically. Kaplan et al. [5] dealt with how companies see their differences from other social media and utilize their potential. Cagnina et al. [6] conducted a study on the business model of a company in Virtual Worlds and Second Life. Papagiannidis et al. [25] described Second Life's take on this retail theater experience. ANoghabaei et al. [225] covered industry trends in AR and VR technology adoption.

6) EDUCATION
Audiovisual-based education is an important application of Metaverse with a high potential for popularization. Experiential education is important because what you see in writing and how you feel while experiencing it are different. For example, radiation is difficult to experience, so you may preconceive that it is simply dangerous. Through the Metaverse, it is possible to see the educational effects that are considered while analyzing and experiencing radioactivity technically and scientifically in Metaverse [226]. Sung et al. [227] compared the level of immersion and three learning outcomes (learning attitude, enjoyment, and performance) based on facial electromyography by comparing marketing students with existing static video presentations and showed that the meta world method is effective in education.
Kemp and Livingstone [4] analyzed the advantages and disadvantages of a multi-user virtual environment for education, and Collins [17] studied how to access, interact, and generate information in higher education. Templeton et al. [228] addressed practical and educational considerations for learning teachers, Suzuki et al. [9] conducted a study on mutual collaboration in learning IoT. Metaverse is used in PBL, a problem-based learning method as an educational framework [229], [230]. Barry et al. [51] evaluated the quality of instruction in the PBL task based on the increase in the number of blinks that made students' emotions unstable and difficult questions. Khan et al. [231] proposed safety training for children in the outdoor environment with VR, Kinect sensor, and the Unity game engine. Muhammad et al. [232] introduced the effectiveness of handheld marker-based AR in the aspects o performance, motivation, attitude, and behavior for primary school students.

D. DISCUSSION AND OPEN CHALLENGE
Component models for modal conversion are developed into various forms, from text-to-image conversion to image-toimage translation and video-to-video synthesis. Although the technology for generating the elements of the textual scenario has matured, the integrated research related to the creation of multimodal applications is insufficient. Along with the study of these transformations, there is also a need for studies on E2E learning that simplifies the integration of modules to reduce the complexity of creating multimodal applications. In addition, values, beliefs, attitudes, memories, and decisions are valuable concepts to expand in-depth applications through psychological and neuro-linguistic programming.

V. METAVERSE CASE STUDIES
The utilization of Metaverse in science fiction (SF) is important not only for CGI (computer-generated imagery) but also for UI design. The futuristic UI shown in Minor Report wearable, G-speak, Iron Man HUD, Oblivion, and Enders games provide visual insight for the Metaverse UI. We discuss what technologies were utilized based on the taxonomy proposed in Section 1 in the Ready Player One movie, which is always referred to when talking about the Metaverse. In addition, we do a case study about Roblox, a representative game of the Metaverse. Finally, looking at the recent research results of Facebook Research, the technical possibilities and approaches of Metaverse are summarized.

A. METAVERSE MOVIE: READY PLAYER ONE 1) PHYSICAL DEVICES AND SENSORS
For Head-Mounted Displays, holographic HMDs and goggles-type HMDs were used in the movie as shown in Fig. 6. The holographic HMD is mounted on the neck and plays the role of displaying it on the front. Gloves with sensors that wrap around the hand are used for hand-based input methods. Non-hand-based input methods show in fullbody suits that can differentiate and deliver the impact of push, punch, and gunshot. In addition, the shock sensor attached to the chest plays a role in delivering the shock from the virtual world to the human body. As an indirect assistive device, a translucent display tablet used by children at school and a tablet with an expandable screen have emerged. The treadmill comes out as a motion input method, and walking and running are distinguished, and a safe environment is considered by fixing it with a belt.

2) RECOGNITION AND RENDERING
For scene and object recognition, object recognition was used in that the name of the game character and the performance and status of the motorcycle was displayed on the scope. Sound and speech recognition and synthesis were not specifically addressed. Basically, it seems that the recognition and expression of dialogue is a domain with rapid technological development, and it is assumed that it will be in a free state in the near future. Likewise, since it is an animation, they did not deal with specific motion rendering. On the other hand, scene and object generation is used in many places. A holographic map that converts to the actual background while zooming in, the rotation of surrounding buildings before the racing starts, and the effects that look mysterious as a light effect in a dramatic situation are representative examples of scene generation. There were detailed object generation methods (e.g., hair that changes according to feeling), UI interfaces for avatars (e.g., Jarvis), musical instruments played without a player, and textured surface reflections.

3) SCENARIO AND STORY
In a museum, personal photographs, home video recording, surveillance, and nanny cams are expressed in a single multimodal content representation. Metaverse's persona data generation is approaching tall, beautiful, scary, different sex, different species, live-action, and cartoon without restrictions. The protagonist has a virtual best friend who has never met in the real world. His unrealistic colors and shapes (e.g., gray skin, machine body, light source clothing, and fish) of the avatar expresses the strengths of the virtual world well. Another characteristic of the movie is the thorough anonymization between the avatar and real self. What happens in the Metaverse is considered as a form that excludes direct influence on the real world. Representative virtual NPCs are shown as a simulation curator who helps introduction and events and an avatar of a creator who progresses the story. Multimodal entity linking is used to create 3D virtual experiences with personal photographs, home video recording, surveillance, and nanny cams. In the case of scenario generation, an important theme runs through the entire film. There are many sub-quests (e.g., collecting coins in the event space, planet12) and the main quest that continues the story.
The entire scenario line features the death of a respected game creator, as well as massive rewards for Easter eggs as missions. The creator's avatar guards three magic gate keys in various places, and the avatars do an adventure to find the keys for the Easter eggs, which rule in Metaverse. The avatars carry it out in the Metaverse through missions that require a reasonable level of common sense and reasoning. It makes you more immersed in the scenario by showing things that are difficult to experience in reality (e.g., announcing the start from the flames of the Statue of Liberty and flying subways). The scenario is populated with the visual experience similar to the real ones in the virtual environment while converting the viewpoint and speed. Scenario evaluation shows whether there is any contradiction while combining the events before and after about the relationship between the creator and his girlfriend.

4) USER INTERACTION
Although language and multimodal interaction are not specifically mentioned, it assumes an organic combination with no synchronization and awkwardness in processing 3D stereoscopic screens and conversations as multi-task interactions. Showing the number of kills and damage by overlaying them on the gun, passing money, and throwing objects in the air shows the advantages of embodied interaction. A car is carried as small as a car key and be put into virtual space as an inventory concept. It gradually expands from a small icon to a real-sized car when taking out the car, giving visual pleasure. User interaction with NPC is also considered. The AI robot helps the protagonist by using a search system in the museum. Immersive OASIS connection video plays a role in distinguishing the real and the virtual. The last motion to exit the Metaverse is used as a pose to take off the goggles.

5) METAVERSE APPLICATIONS
Basic simulations (e.g., ballet, boxing, piano, dance, and tennis) appear at the beginning. The opening video shows that things that are difficult to experience in ordinary life (e.g., hang gliding, waves in Hawaii, skiing at the pyramids, climbing with Batman and Everest, a planet-sized casino, divorce, and marriage) are possible. In the Metaverse, Gregarious games, Minecraft, and 3D pinball are minigames, and avatars receive reward coins according to their level and risk. Avatar acquires coins when a car and a person breaks in Metaverse, but visual effects and damage realistically apply to vehicles collide. Through a scoreboard with rankings in the Metaverse, the intermediate process and results of the game are shared.
Regarding the Office, we expect an organizational approach of a company when looking at the employees of IOI companies who work in an independent space in a company with the shape of an avatar. It also talks about the possibility of an organized group with a commercial goal as an interesting company appearing inside the Metaverse and going to a racing game as if it were a job. From a social point of view, class according to grade, disconnection from children due to game participation, and side effects of anger caused by immersive game participation are described.
There is a view of blocking from the real world, which is shown by anonymizing names with numbers instead of names. On the other hand, some example shows that communicate through the interface between the real world. Even in the Metaverse world, a marketing singularity sells offline items (e.g., suits) and is appropriately used in game items. Although it is in the Metaverse that mimics the real world, unrealistic control function items in the Metaverse (e.g., the time turning item) are also used in a balanced way that does not threaten the world view. The appearance of education suggests that the form of education will not be much different from what it is now, although a translucent display window tablet is used.

6) DISCUSSION AND OPEN CHALLENGES
Player Ready One shows negative aspects of the Metaverse (e.g., surrogate exam, taste cheating, and mirroring). The problem of over-addiction is explained in the appearance of upgrading a suit with the money to be paid for the rent due to the virtual world and excessive immersion. Metaverse is based on separation from reality but depicts the fact that virtual damage is done to the real world. Finding the owner of the avatar in the real world and jumping out the window in anger over defeat are mental problems in the immersive Metaverse. The appearance of falling off a chair and falling backward is expressed as an example in which the Metaverse inflicts real physical damage.
Metaverse implementation is not described in detail in the movie except for the concept of life-long learning for the museum scene because it is a technical detail. By visualizing the avatar as a hologram in the real world, they showed it is possible not only in the forward direction (i.e., from reality to the Metaverse) but also in the reverse direction (i.e., from the Metaverse to reality). Eating, sleeping, and bathroom breaks are seen as new expandable possibilities in the sense that they are not done in the Metaverse.

B. METAVERSE GAME: ROBLOX
Roblox served by two-thirds of 9-12 years old in the United States and is a representative game of Metaverse with an MAU of 150 million [1]. Roblox is also used to develop simulations of urban environments to describe experiences that incorporate the realization of virtual paths to the city's sculptural heritage in the classroom, as shown in Fig. 7. Students were able to understand and integrate Santa Cruz's sculptural heritage to create their own interactive world with Roblox [2]. Creatively interpreting legacy in both formative and programming is a good approach. Although norms between education and entertainment have often been regarded as two separate worlds, Roblox is used as an educational tool in the classroom from the perspectives of motivation, problem-solving, and STEM [3].

1) PHYSICAL DEVICES AND SENSORS
Roblox supports Oculus Quest 2 and HTC Vive for 3D HMD. VR has a gyroscope, display screen, and built-in audio VOLUME 10, 2022 system that provides a VR experience as an independent device that is different from the Google Cardboard method using a smartphone screen. The Quest 2 is compatible with PlayStation VR and does not require a gaming PC, but a VR-capable PC is required to run Roblox. It is the cheapest VR head for Roblox, and it has a vast resolution of 1832 × 1920 pixels per eye, but the front of the device is heavier, which is inconvenient. With full VR support, users play games like Skyrim, Half-Life: Alyx, and No Man's Sky in Roblox. The HTC Vive Pro has a resolution of 1440 × 1600 pixels and has a built-in audio system. There is additional padding throughout the device to share weight and balance, but the HTC Vive Pro only works via DisplayPort. Roblox supports VR devices and supports VR for some games, but it is still limited. In addition, most devices are limited to HMDs, so they lack the versatility of a tactile and pressure sensor that a normal Metaverse would consider supporting. Since the main customers are a young generation, they seem to focus on simple forms over complicated ones.

2) RECOGNITION AND RENDERING
The SW used in Roblox is Lua and Roblox studio. Lua is a small interpreter language with a capacity of several hundred tens of KB. The script is a programming language that can be executed line by line and is a tool that creates events that occur in the game, physics engine, text output, and screen effects. It was developed with the goal of being a lightweight scripting language with a clean syntax that is easy to embed into C/C++ programs.
Roblox Studio is the official free utility software for creating custom games for Roblox. Users configure various game worlds and servers (e.g., mini-games, obstacle courses, and role-playing stories). Its characteristic is that even lowquality games are made by own hand and enjoyed with friends. Because operation on low-end devices (e.g., mobiles) cannot be played in high-end games, they have an experience with lag, skin color errors, dialogue control errors, airplanes malfunction, etc. UnityML has the advantage of being able to link and use various recognition methods in a 3D environment, but Roblox is often composed of simple and lightweight forms, so that part is insufficient.

3) SCENARIO AND STORY
There are many young users, so the game effects and scenario complexity are low, but it is diverse and novel. It is an online game creator system in which most of the content is produced by amateur game creators. User-generated content is an avatar accessory created by a user, and it is an item that can be created when a user has a reputation within the community and is an expert who handles modeling programs well. When users satisfy the conditions and get Roblox's certification, users get permission to create items using mesh, and users have to pass Roblox's review. It doubles the elements of the game with limited items but also creates a counterfeit UGC and copies unique items to bring about a deflationary effect.
It is also used for the misuse of items that can disguise the character in the Roblox game.
On the other hand, the strength of Roblox is that various users can easily create new games. Although it is relatively simple, it presents a new perspective with various and novel approaches. However, each game is centered on a single story and lacks the depth of the story because it does not have an elaborate plot. Sometimes, the story is similar, and the story development is not stable. When there is an authoring tool (e.g., multimodal story generator) to make and evaluate plots, users easily create more in-depth content and games.

4) USER INTERACTION
The Metaverse trading system supports user exchange with other players for dollars, so a connection with the real world is also considered. Premium service provides a differentiated service that makes shirts and pants, sells them free, and sells them at a price. It also reinforces the interaction by providing an online hangout-concept space called a separate party place. There are various auxiliary methods (e.g., facial expressions, clothes, motions, and words) to express their feeling in Roblox. However, there is room for improvement in realtime and tactile interactions. Special attention is required for interaction because children spend a lot of time. It supports multiple languages but has a low level of translation quality.

5) METAVERSE IMPLEMENTATIONS
One of the most problematic for Metaverse commercialization is stable operation, especially 3D rendering, for many concurrent users. From an operational point of view, there are problems with hacking, extortion, and server down. Management and efforts are in place to ensure user safety (e.g., prevent profanity, review on image uploads, parents prohibit chatting, more than 1,600 administrators), but as the scale grows, the number of users' improper behaviors increases. Games administrator build their own reporting systems for these shortcomings and sanction them. On the other hand, excessive restrictions and privacy authority are also a problem. There is a privacy issue where the management can censor personal messages and know the current location.

6) METAVERSE APPLICATIONS
Roblox is a game playground. Since there are not many games that elementary school students can play easily and comfortably, it can be seen as an imaginative game that can be seen in playgrounds. Since game items and passes are possible to break the game balance, the balance is important for commercialization. A concert called One world together at home is also opened as an application. It is used as a tangible connection medium to generate revenue through the production of ZEPETO items and to deliver from the real world to the virtual world using Roblox currency. There are phenomena that are seen in general society, hyperinflation following the abolition of Ticks.

7) DISCUSSION AND OPEN CHALLENGES
Roblox supports VR, but non-VR games account for a significant portion, and there is a possibility that it will develop into a more advanced form based on a large number of subscribers. Roblox is well known to the younger generation, so children can learn to code and make friends by taking Roblox coding classes and camps. However, there is a problem that it is difficult to check all the contents because there is 50 million game content despite the overall user acceptance level. This management problem is problematic in that the primary user class is relatively lower age.

C. METAVERSE RESEARCH: FACEBOOK RESEARCH
Based on the papers published in Facebook Research from January to June 2021, we classified each paper into the taxonomy defined in Section 4 and summarized our approach in terms of Metaverse utilization, as shown in Fig. 8.

1) PHYSICAL DEVICES AND SENSORS a: HEAD-MOUNTED DISPLAYS
One of the hallmarks of Metaverse using a head-mounted display is that it sees the world from an egocentric perspective. Most video processing uses third-person video VOLUME 10, 2022 data sets, so egocentric video data is not enough. Thirdperson view data is not directly available in the Metaverse due to the inconsistency of the viewpoints, so an approach that transforms it into an egocentric video model is required. Li et al. [233] generated a model that exploited knowledge distillation loss during pre-training to obtain both the scale and diversity of third-person video data, as well as representations with prominent egocentric properties.
Xian et al. [234] presented a method for learning a spatiotemporal neural irradiance field for a dynamic scene that enables preview rendering of the input video. Using the scene depth estimated in the video, they constrained the time-varying geometry of the dynamic scene representation and presented a single global representation of the contents of individual frames. Generating expressive camera motion for autonomous flight technology is difficult because it requires editing of several control parameters that are not intuitive for users. Bonatti et al. [235] developed a data-driven framework for editing complex camera position parameters in semantic space. They constructed a semantic control space by analyzing the correlation between technicians based on the study of filming guidelines and human perception.

b: HAND-BASED INPUT METHODS
On a physical keyboard, the resistance of the keys prevents erroneous input, but in Metaverse, it is needed to isolate spurious input events when typing with a virtual keyboard. Foy et al. [74] showed three alternative co-activation detection strategies with high accuracy. They developed StickyPie, a marking menu technology that enables scale-independent marking input by estimating intermittent landing positions. They identified issues inherent in eye movement control and current eye-tracking hardware, including erroneous selection activation, while reducing workload and eyestrain.
Natural hand manipulation is a task that requires complex finger manipulation to adapt to the shape and task of an object. Zhang et al. [236] proposed a generalizable handobject space representation combining voxel occupancy and global object shape with local geometric details to the nearest sample. Hand social contact is essential for social interaction and communication and reduces anxiety and loneliness. Rognon et al. [237] Introduced mediated social contact that conveys indescribable emotions (e.g., love, empathy, reassurance), allowing devices to transmit haptic signals and physically interact at any distance.

c: NON-HAND-BASED INPUT METHODS
Research on input devices using wrist motion without directly attaching to the hand is also increasing. With the growing interest in vibrotactile feedback in wearable wristband devices, Chase et al. [238] used information transfer as a metric to explore the signal variation space within a single vibrotactile actuator (e.g., frequency, amplitude, and modulation). Typical control systems rely on digital on/off control to limit the degrees of freedom available when designing haptic experiences, allowing only inflate/decrease at a set rate. Stephens-Fripp et al. [239] presented an alternative system in which analog control of the pneumatic wave profile can be used to determine the optimal wave profile. The attack and release profile have been altered to create a more pleasant pulsating sensation at the wrist and a more lasting sensation of transmitting movement around the wrist.

d: MOTION INPUT METHODS
To accurately estimate 3D human movement, both kinematics (i.e., body movement without physical force) and dynamics (i.e., movement with physical force) must be modeled. Yuan et al. [240] presented a SimPoE, a simulationbased approach for 3D human pose estimation that integrates image-based kinematic inference with physics-based dynamic modeling. To obtain accurate pose estimates, a metacontrol mechanism was used that dynamically adjusts the character's dynamic parameters according to the character's state. Neverova et al. [241] jointly learned the geometry of several categories of deformable objects to learn integrated dense pose predictors for several categories of related objects. It has symmetric inter-category periodic consistency and a new asymmetric image-category periodic consistency and has improved performance over methods for 3D shape matching without manual annotation of inter-category correspondences.

2) RECOGNITION AND RENDERING
Lucas and Kozary [242] focused on the basics of teaching computers to think like humans when making decisions about visual content that are most interesting and important to human viewers. Computers see colors as numbers rather than meaningful parts of an image, and textures see numbers rather than meaningful hard and soft parts of an image. Some parts are similar to human perception, but there are also other parts, so the difference between humans and computers is an important research field.

a: SCENE AND OBJECT RECOGNITION
The hard inductive bias of CNNs allows for sampleefficient learning, but at the expense of potentially lower performance limits. Vision Transformers (ViTs) rely on more flexible self-attention layers and perform better than CNNs in image classification. However, expensive pre-training on large external data sets or distillation of pre-trained convolutional networks is required. d'Ascoli et al. [243] introduced gated positional self-attention (GPSA), a form of positional self-awareness equipped with soft convolutional induced bias. The use of cropping can bias large objects to be clipped or omitted, as described in Lorenzo et al. [244] proposed a new crop recognition bounding box regression loss (CABB loss) that facilitates prediction to match the visible part of the cropped object. In response to the disproportionate distribution of object sizes, they introduce a new data sampling and augmentation strategy that improves generalization across scales. Cheng et al. [245] updated the standard evaluation protocol, for instance, and panoptic segmentation tasks by proposing Boundary AP (Average Precision) and Boundary PQ (Panoptic Quality) metrics, respectively, based on Boundary IoU for image segmentation evaluation.
The spatially deformed spatial resolution of the retina is utilized for foveated video compression for immersive video requiring large bandwidths of high spatial and temporal resolution. Yize et al. [246] proposed FED (Foveated Entropic Differencing), a Full Reference (FR) centric image quality evaluation algorithm for centric video compression. Xiong et al. [247] presented a multi-view pseudo-labeling approach using complementary views in the form of shape and motion information for semi-supervised learning in the video. By acquiring pseudo-labels from unlabeled videos, more robust video representations were learned than purely supervised data.
It is also necessary to use text information as well as video. Huang et al. [248] proposed Multiplexed Multilingual Mask TextSpotter, an E2E approach, for end-to-end education and scalable multilingual multi-purpose OCR system. They kept the integration loss of performing script identification at the word level and processing different scripts with different recognition heads while simultaneously optimizing script identification and multiple recognition heads. In the past, scene text-based inference separated from OCR systems was difficult due to the lack of ground-truth text annotations or scene text detection and recognition datasets for real images. Singh et al. [249] introduced a TextOCR for detecting and recognizing scene texts of arbitrary shape with 900k annotation words collected from real images in the TextVQA data set.
Since the Metaverse assumes a 3D environment, many 3D-related skills (e.g., fast rendering and few-shot learning) are required. Sodhani et al. [250] proposed an open-source OpenNEED consisting of a large-scale, high-frame-rate noneye (head, hand, and scene) and eye (3D gaze vector) data set. They proposed a robust eye tracker design considering noneye sensors to study the relationship of head, hand, scene, and gaze and apply spatiotemporal statistics to gaze estimation. Henzler et al. [251] proposed a new neural network called warp-conditioned ray embedding (WCR) that focuses on training a model on multiple views on a large collection of object instances to learn a deep network that reconstructs in 3D given a small number of images. Ren et al. [75] introduced WyPR, a weakly supervised framework for point cloud recognition that requires only scene-level class tags as a director. They proposed to solve jointly by combining point-level semantic segmentation, 3D proposal generation and 3D object detection, and self-and cross-task coherence loss prediction.
Liu et al. [252] proposed an Unbiased Teacher to identify the pseudo-label bias problem of SS-OD (Semi-Supervised Object Detection). It is a simple but effective approach to train students and progressively develop teachers mutually beneficially jointly. Chen and He [253] used negative sample pairs, large batches, and momentum encoders to avoid solution decay and showed that the gradient stopping operation plays an essential role in preventing decay. Tian et al. [254] studied the nonlinear learning dynamics of uncollated SSL in a simple linear network where SSL with only positive pairs avoids expression decay. They investigate conceptual insights into how the disjoint SSL method learns, how to avoid expression collapse, and how several factors (e.g., predictor networks, stationary gradients, exponential moving averages, and weight reduction) work.

b: SOUND AND SPEECH RECOGNITION
Metaverse runs in a variety of places, from a relatively quiet house to a space where a variety of people gathers. Donley et al. [255] proposed Linearly Constrained Minimum Variance (LCMV), an automated solution for multi-channel signal enhancement to improve voice communication in a noisy environment. They use the beamformer to estimate the relative source contribution of each source in the mixture and then used to weight statistical estimates of the spatial properties of each source used for the final separation. It allows instant selection of desired and undesired sources. Furthermore, it improves multi-channel speech enhancement for dialogue, aiming to extract clear speech from a noisy mixture using signals captured by multiple microphones. Panagiotis et al. [256] applied a graph neural network (GNN) to find the spatial correlation between various channels and integrated it into the embedding space of the U-Net architecture with the graph convolution network (GCN). Helmholz et al. [257] introduced Real-Time Spherical Array Renderer (ReTiSAR) to analyze the sensor's own noise propagation through the processing pipeline. The instrumental evaluation confirmed the strong global impact of various arrays and rendering parameters on spectral balance and the overall level of rendered noise. They determined the audible threshold of coloring artifacts during head rotation for various array configurations in a perceptual user study. Helmholz et al. [257] applied binaural rendering of a spherical microphone array signal to increase the SNR of the rendered signal by up to 9 dB with some array configurations with larger radii and spherical harmonic order four or higher microphones. Chazan et al. [258] presented an integrated network for speech separation of an unknown number of speakers and presented a noise and reverberation dataset for five speakers.
Research on high-quality surround sound audio is based on a fixed position of the recording microphone in general, such as in a movie theater. However, in Metaverse, users can change their listening position as they run, spin, or various body changes. Birnie et al. [70] proposed a method for binaural playback of microphone recordings in a virtual application in which one's body freely moves beyond the recording location. They integrate near, and far sources in an extended virtual environment and better reproduce the intensity and binaural room impulse response spectrum of the near environment.
Early room reflection estimation is an important task in audio signal processing, along with beamforming, source separation, room geometry inference, and spatial audio applications. Shlomo and Boaz [259] proposed a solution for blind estimation of reflection amplitudes using iterative estimators based on maximum likelihood and alternating least squares. Blind estimation of direction of arrival (DOA) and delay of indoor reflections due to reverberation is useful for a wide range of applications, but conventional methods detect only a few reflections. Shlomo and Boaz [260] proposed PHase ALigned CORrelation (PHALCOR) for estimating early reflex delay and DOA blinding.
Head-Related Transfer Function (HRTF) is used to simulate external sound by measuring the sound source's spectrum in three-dimensional space. HRTF individualization enables realistic and immersive spatial audio rendering in Metaverse. Zhou et al. [261] identified the lowest spectral distance error by exploring the range of HRTF predictability using a deep neural network with a 3D ear shape as input. In practice, binaural reproduction is also affected by HRTF, along with truncation errors that detrimentally affect the perception of the reproduced signal. Because pretreatment of HRTF by ear alignment prevents effective recognition, Ben-Hur et al. [262] presented a method for integrating preprocessed ear-aligned HRTFs into the binaural regeneration process. HRTF is the key to audio spatialization. However, it cannot produce sufficient sound output levels at low frequencies (below 300 Hz) while maintaining an omnidirectional pattern. To address this problem, Chojnacki et al. [263] proposed a new design to overcome the limitations of this low-frequency range at higher frequencies. Gari et al. [264] analyzed and rendered multi-channel RIR (Room Impulse Response) by parameterizing the sound field as a series of plane waves for the Spatial Decomposition Method (SDM). They reduced the unnatural arrival direction diffusion of late reflections by spatial clustering of reflections in the post-processing and solved the whitening problem of late reverberations with a binaural RIR corrected equalization method, RTMod+AP.

c: SCENE AND OBJECT GENERATION
Ge et al. [265] proposed DoodlerGAN, a generative partbased GAN (Generative Adversarial Network) that generates creative and high-quality images to generate invisible configurations of new part shapes. They also introduced two creative sketch datasets: Creative Birds and Creative Creatures. Aiming to increase the resolution and level of detail within super-resolution images, Roziere et al. [266] utilized an evolutionary method to improve NESRGAN+ by optimizing noise injection at inference time. They proposed Diagonal CMA to optimize the injected noise according to a new criterion that combines quality assessment and realism. Lassner and Zollhofer [267] proposed Pulsar, an efficient sphere-based differential rendering module that is fast, modular, and easy to use. It avoids topological problems by using spheres for scene representation. It uses an efficient differential projection operation and neural shading to alleviate topology inconsistency problems, high memory footprint, and slow rendering speed.

d: SOUND AND SPEECH SYNTHESIS
The computational complexity of the transformer increases twofold with sequence length, making it impractical for many real-time applications. Wu et al. [268] proposed an efficient transformer-based acoustic model with constant speed regardless of input sequence length for streaming speech synthesis applications. They used the Emformer network to predict frame rate spectral characteristics in streaming and WaveRNN neural vocoder to generate the final audio by taking the predicted spectral characteristics. They demonstrated consistent performance, low latency, and low real-time performance over various utterance lengths. Richard et al. [269] presented a neural rendering approach for binaural sound synthesis that generates spatially accurate binaural sound in real-time. They proposed end-to-end neural binaural sound synthesis that outperforms DSP-based methods in a perceptual study and a qualitative evaluation.

e: MOTION RENDERING
Control strategies for physically simulated characters performing two-person competitive sports (e.g., boxing and fencing) are used as a reference for effective motion rendering in the Metaverse. Won et al. [270] developed a learning framework for generating control policies for physically simulated athletes with many degrees of freedom. They presented a control policy learned from a framework that generates both tactical and natural behavior. Ye et al. [271] proposed a learning-based approach that infers an object's 3D shape and poses from a single image and learns from a collection of atypical images supervised only by the segmented output of an off-the-shelf recognition system (i.e., shelf supervision). They inferred the volume representation of standard frames together with camera poses. After that, they performed shape-pose decomposition and instance-byinstance reconstruction of image collections in more detail. Yuan et al. [272] proposed a STAR that performs selfsupervised tracking and reconstruction of dynamic scenes with rigid motion in multi-view RGB video without manual annotation. By decomposing into two component parts and encoding each into its own unique neural expression simultaneously, the dynamic scene is reconstructed as a single solid object in motion. They also jointly optimized the parameters of the two neural luminosity fields and a set of fixed poses that align the two fields in each frame. Ng et al. [273] studied body motions for 3D hand shape synthesis and estimation in the area of conversational gestures based on the assumption that body movements and hand gestures are strongly correlated in non-verbal communication environments. Hand prediction model generates a 3D hand gesture with only the 3D motion of the speaker's arm as input. Eisenberger et al. [274] proposed a neural network architecture NeuroMorph that takes two 3D shapes as input and generates them at once in an end-to-end learning method. It is in a fully unsupervised manner without manual correspondence annotation. By combining graph convolution with global feature pooling to extract local features, geodesic lines are approximated in this shape-space manifold to produce realistic deformations.

3) SCENARIO AND STORY a: MULTIMODAL CONTENT REPRESENTATION
In the task of retrieving linked query images from a database, Chen et al. [275] proposed to express an image as a constituent object based on the intuition that the finest detail of manipulation is often at the object level. They introduced an object-embedding framework for OE-SIR (Spliced Image Retrieval) using object detectors and a teacher-student model to localize object regions.

b: PERSONA DATA GENERATION
Kiela et al. [276] introduced Dynabench, an open-source platform that runs in a web browser and supports the creation of human-in-loop model datasets for dynamic model benchmarking. With Dynabench, data set creation, model development, and model evaluation inform each other directly, making it a more powerful and informative benchmark. Current models for Word Sense Disambiguation (WSD) are human-level performance in global WSD metrics but lack data to model and evaluate rare senses. Blevins et al. [277] established criteria for FEWS using knowledge-based neural WSD approaches and better captured rare sensations in the WSD dataset with a model further trained with FEWS.

c: MULTIMODAL ENTITY LINKING AND EXPANSION
Context and entity affinity are mainly captured via vector dot products, potentially missing fine-grained interactions between them, requiring large memory footprints to store dense representations. De Cao et al. [278] proposed GENRE to generate unique names for each token in a left-to-right autoregression method and search for entities according to context. It directly captures the relationship between context and entity name, effectively cross encoding both and greatly reducing memory footprint because it scales with the lexical size rather than the number of entities.

d: SCENARIO GENERATION
For the motion transfer task between the one dancer and the target person, Gafni et al. [279] proposed a model to reanimate a single image with an arbitrary video sequence. They combine three networks: a segmentation mapping network, a realistic frame-rendering network, and a face enhancement network.

e: SCENARIO POPULATION
Data augmentation methods experience distribution shifts and consequently degrade the performance of non-augmented data during inference. Gong et al. [280] used a saliency map to detect important regions in the original image and preserved these information regions while augmenting them. Because moments extracted from instance normalization and position normalization roughly capture the style and shape information of the image, Li et al. [281] proposed Moment Exchange, an implicit data augmentation method that encourages models to utilize moment information in recognition models as well. Knowledge Distillation (KD) tends to make inconsistent predictions when the data distribution changes slightly, so a method is needed to apply it to low-resource (both memory and computational) platforms. Liang et al. [282] proposed MixKD, a dataagnostic distillation framework that utilizes a simple but efficient data augmentation approach to give the resulting model stronger generalization capabilities.

f: SCENARIO EVALUATION
Jia et al. [283] systematically studied whether the extent visual information (i.e., objects and contexts) contributes to understanding human motives to analyze how visual information easily recognizes human intentions behind social media images. They introduce Intentonomy, an intent dataset consisting of 14K images covering a wide range of everyday scenes to study the present intentions. When training intent classifiers, they performed additional studies to quantify the effects of attending object and context classes and textual information in the form of hashtags. Huang et al. [284] implemented a post-processing step with simple modifications to the standard label propagation technique in the initial graph-based semi-supervised learning method. A cyber-physical digital twin is a simulation of a non-software (physical) system, which has recently received much attention, but its cyber-cyber response is relatively overlooked. Ahlgren et al. [285] measured the practical impact on digital twins' design, implementation, and deployment as conceptually true twins by simulating other software systems.

4) USER INTERACTION
Speech recognition-based natural language dialog is the basic medium of user interaction. Recently, BlenderBot 2.0 [286] was proposed based on two studies: Internet search engine-based generation and long-term memory integration, as shown in Fig. 9. The LM-based dialog generation model has the hallucination problem of generating plausible sentences that are factual. To prevent this problem, searching using the Internet and generating a final response based on the searched information was proposed. In addition, in most conversational studies, many short conversations (typically 2-15 turns) consist of a single conversation session because the dialogue engine gives scenario-specific answers rather than responses based on long-term memory. The proposed model provides improved search capabilities with the ability to summarize and recall previous conversations. However, since it is a model with an open dialogue that can expose personal behavior (e.g., long-term memories and the VOLUME 10, 2022 speaker's personal interest), careful attention is required in management.

a: LANGUAGE INTERACTION
Reference games illustrate the functional use of language for communication and provide a basic learning environment for neural agents. Languages are inherently biased by the underlying capabilities of agents. Dagan et al. [287] introduced the Language Transmission Simulator to model agent populations' cultural and architectural evolution. They emphasize the importance of studying basic agent architectures and propose coevolution of languages and agents in the study of language emergence. With the recent development of LM, it is widely used for various tasks of natural language and various modals. Recurrent Neural Network Transducer (RNN-T) is a famous method in automatic speech recognition due to its simplicity, conciseness, and general transcription, but it lacks an external language model and is more vulnerable to rare long-tail words (e.g., entity names). Le et al. [288] proposed RNN-T to model intractable rare WordPieces by injecting additional information into the encoder and using alternative letter pronunciations. Deep fusion with personalized language models for stronger biasing. Weber et al. [289] considered language modeling as a multi-task problem, combining three studies: multitask learning, linguistics, and interpretability, to analyze the generalization behavior of language models in Negative Polarity Items (NPIs).
QA is the most basic solution for communicating with NPCs in the Metaworld. Annotated data sets are difficult and expensive to collect and rarely exist in languages other than English. That is the reason it is hard to build a QA system that works well in other languages. Lewis et al. [290] proposed a multi-dimensionally ordered extractive QA evaluation benchmark MLQA. Xiong et al. [291] proposed a simple and efficient multi-hop dense search approach to answer complex open-domain questions, achieving state-of-the-art performance in two multi-hop data sets, HotpotQA and multievidence FEVER. Min et al. [292] proposed a model to build a system that can predict correct answers in open QA that receives natural language questions as input and returns natural language answers while meeting strict disk memory budgets. Memory budgets encourage agents to explore a balance between storing parameters for large and redundant search corpora and large training models.
Multilingual support is required to compose a natural interface while covering a wide range of Metaverse. Because the common language (e.g., English) has limitations for fluent communication, multilingual translation is required to provide a natural interface in other languages. Schwenk et al. [293] presented a multilingual sentence embedding-based approach to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages. Other modalities tend to generate similar decoder representations and preserve more information in pre-trained text translation modules. Tang et al. [294] proposed a parameter sharing and initialization strategy to enhance information sharing between tasks. It is a new attention-based regularization for encoders and an online knowledge distillation method to improve knowledge transfer. The quality assessment aims to measure the quality of translated content without access to reference translations. Tuan et al. [295] proposed a method that does not rely on examples from human commentators but instead uses synthetic training data.

b: MULTIMODAL INTERACTION
Noise contrast learning for videotext representation learning increases the similarity of representations of pairs of known samples and repels all other representations. Patrick et al. [296] proposed a method to mitigate this by using generative models to push these related samples naturally, as depicted in Fig. 10. The captions of each sample were reconstructed as weighted combinations of the visual representations of other supporting samples. It is difficult to learn the grounding of each word due to noise and the presence of words that cannot be visually meaningfully grounded. Meng et al. [297] presented a jointly trained model architecture for controlled trace generation and controlled caption generation. They proposed a local bipartite matching (LBM) distance measurement that compares two traces of different lengths to evaluate the quality of the generated trace. Because audio and video signals are not always informed of each other, audiovisual correspondences often result in false positives. It optimizes the weightedcontrast learning loss and lowers its contribution to the overall loss. Morgado et al. [298] optimized the instance identification loss with a soft target distribution that estimates the relationship between instances. Morgado et al. [299] optimized visual similarity rather than simple cross-modal similarity using SS based on contrast learning with crossmodal audio and visual recognition.

c: MULTITASK INTERACTION
Szot et al. [300] proposed a simulation platform for training virtual robots in interactive 3D environments and complex physics-based scenarios. Experimental results showed that flat RL policy suffers from HAB (Home Assistant Benchmark) compared to the hierarchical policy, hierarchical structure with independent technology suffers from takeover problem. In audiovisual exploration, agents use both sight and sound to move through complex and unmapped 3D environments intelligently. Chen et al. [301] showed how to operate at a fixed granularity of agent behavior and rely on simple iterative aggregation of audio observations, as shown in Fig. 11. It uses waypoints that are dynamically set, and end-to-end learned within the search policy. Acoustic memory provides a structured and spatially based record of what the agent hears as it moves. Recent work on audiovisual navigation assumes a continuously audible target, and the role of audio in announcing the target's location is limited. Chen et al. [302] introduced semantic audiovisual exploration in which objects in the environment make sounds consistent with their semantic meaning (e.g., flushing toilets, creaking doors) and in which acoustic events are sporadic or short-duration. They proposed a converter-based model for handling this new semantic AudioGoal task by incorporating an inferred goal descriptor that captures an object's spatial and semantic properties. Persistent multimodal memory allows the target to be reached even after the acoustic event has stopped. ObjectGoal Navigation (OBJECTNAV) is the task of an agent navigating object instances in an invisible environment, which degrades performance due to overfitting and sample inefficiency. Ye et al. [303] integrated the learned components and motivated methods that operated on explicit spatial maps of the environment and reactivated the general learning agent by adding auxiliary tasks and navigation rewards.

d: EMBODIED INTERACTION
It's unclear how to optimize the layout of 3D UI controls for body and aerial interactions. Li et al. [72] evaluated the performance and limitations of a non-dominant fixed 3D UI in a VR environment through a two-handed pointing study. It has been demonstrated that targets that appear closer to the skin (i.e., located around the wrist placed on the inside of the forearm) can be selected faster than targets that are further away from the skin (i.e., around the elbow on the side of the arm). Bagautdinov et al. [304] presented a learning-based method for constructing a driving signal recognition whole body avatar. They generate high-quality representations of human geometry and view-dependent shapes using conditionally deformable auto-encoders that are animated with imperfect driving signals (e.g., human poses and face key points). Better drivability and generalization were achieved by separating the unusable driving signals and the rest of the generated elements during animation.
Modeling thin structures (e.g., hair) has low resolution and is too slow. Lombardi et al. [305] showed a dynamic 3D content rendering representation that combines the completeness of a physical representation with the efficiency of primitive-based rendering. It utilizes spatially shared computations with a convolutional architecture and uses volumetric primitives that are moved to cover only the occupied portion of space. Sun et al. [306] introduced a hair inverse rendering framework for reconstructing high-fidelity 3D geometry and reflectivity of hair that is easily used for realistic rendering of hair. They proposed a new solution for line-based multi-view stereo that calculates accurate hair geometry from multi-view metering data and estimate hair reflection characteristics using multi-view metering data.
In the Metaverse, avatar clothing is not just for decoration but a means of providing immersion and emphasizing social roles. To create high-definition animations, Xiang et al. [307] proposed a method to create an animable clothed body avatar by explicitly representing the upper body's clothing in a multi-view capture video, as shown in Fig. 12. To separately register 3D scans with the template using a two-layer mesh representation and to improve photometric responsiveness, they perform texture alignment through the inverse rendering of the garment geometry and texture predicted by the deformation autoencoder. Chaudhuri et al. [308] proposed ReAVAE (Region-adaptive Adversarial Variational Variational AutoEncoder) that learns the probability distribution of each region individually to generate various high-fidelity texture maps for 3D human meshes by sampling from the distribution for each region. They present a data generation VOLUME 10, 2022 FIGURE 11. Learning to set waypoints for audio-visual navigation in the indoor environment [301].

FIGURE 12.
The process of cloth rendering which includes single-layer surface tracking and inner-layer shape estimation [307]. technique that augments the training set with data taken from a single view RGB input.
It can be generalized to natural lighting conditions, but it is computationally expensive to render. Bi et al. [309] presented a method to build animable high-definition 3D face models that can pose and render in real-time in a novel lighting environment. They train a generalizable model and use it to generate a training set of high-quality synthetic face images under natural lighting conditions. The neural shading phase accounts for deformations that are not captured in the mesh and alignment inaccuracies and dynamics that confound the DNR pipeline. Raj et al. [207] proposed Articulated Neural Rendering (ANR), a DNRbased framework that explicitly addresses the limitations of virtual human avatars. Ma et al. [310] proposed Pixel Codec Avatars (PiCA), a deep generation model of the 3D human face that is computationally efficient and adaptable to in-run rendering conditions while achieving state-of-theart reconstruction performance as depicted in Fig. 13. They use a fully convolutional architecture for decoding spatially varying features and a rendering adaptive per-pixel decoder to integrate through dense surface representations learned in a weakly supervised manner from low-topology mesh tracking on training images. It is strong at testing expressions and opinions about people of different genders and skin colors.

5) METAVERSE IMPLEMENTATIONS a: MULTIMODAL INFERENCE
Self-supervised pre-training can outperform full-supervised training and is useful in preventing overfitting to smaller data sets. Shukla et al. [311] showed the potential of visual self-supervision for learning audio functions. They proposed that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition. The proposed multi-task combination of visual and auditory self-supervision is useful for learning more powerful and rich functions in noisy conditions.

b: RL-BASED APPROACHES
Procedurally generated environments require algorithmically generated environment instances using a unique variable factor configuration as an important benchmark for testing systematic generalization in deep reinforcement learning. Jiang et al. [312] proposed Prioritized Level Replay (PLR), a general framework for selectively sampling the next level of training by prioritizing items that are expected to have higher learning potential upon future revisit. TD errors lead to new curricula of increasingly difficult levels when used to effectively estimate the future learning potential of a level and guide the sampling procedure. Modhe et al. [313] proposed a novel framework that provides exploration and sample complexity to identify sub-objectives that are useful for exploration in sequential decision-making tasks under partial observability. They utilized a variant-specific control framework that maximizes empowerment to reach various states reliably. It identifies sub-goals as states with high essential, optional information through the normalization of information theory.
To efficiently control dynamic systems in high-dimensional sensory observations, learning controllable embeddings (LCEs) embed observations in low-dimensional latent space and estimate latent dynamics to perform control in latent space. Cui et al. [314] proposed a modified valueguided CARL that optimizes the weighted version of the CARL loss function whose weights depend on the TDerror of the current policy. In the offline implementation, the local linear control algorithm (e.g., iLQR) used in the existing LCE method was replaced by the RL algorithm (i.e., a model-based soft actor-critic). Model-based reinforcement learning is a method that utilizes control-based domain knowledge to improve the sample efficiency of reinforcement learning agents. Policies tend to lag behind model-free agents in terms of final rewards, especially in environments where they are not critical. Amos et al. [315] found an effective combination of model-free soft value estimation for policy evaluation and model-based stochastic value gradient for policy improvement for model-based high-dimensional humanoid control tasks.

c: LIFE-LONG LEARNING
Sukhbaatar et al. [316] proposed Expire-Span, which learns how to retain the most important information and expire irrelevant information, as not all past contents need to be remembered equally. To evaluate models for lifelong learning tasks, Abdelsalam et al. [317] developed a standardized benchmark that enables model evaluation in IIRC settings. Methods incorporating network scaling naturally add model capacity for learning new tasks while avoiding catastrophic oblivion, but increasing the number of additional parameters is computationally expensive at larger scales. Verma et al. [318] proposed a simple task-specific feature map transformation strategy for continuous learning called Efficient Feature Transformations (EFT). It adds a minimal number of parameters to the underlying architecture, providing strong flexibility to learn new tasks. To solve the catastrophic forgetting problem in a sequential task where data from previous tasks are not available, Mehta et al. [319] proposed a principled Bayesian nonparametric approach, the Indian Buffet Process (IBP), which determines how much the data scales to model complexity. The IBP dictionary promotes positive knowledge transfer between tasks by encouraging sparse weighted element selection and element reuse. The goal of continuous learning (CL) is to learn a series of tasks without experiencing catastrophic forgetting. Ebrahimi et al. [320] proposed a simple educational paradigm, Remembering for Right Reasons (RRR), by encouraging explanations so that models have the right reasons for their predictions.

d: MULTI-AGENT OPTIMIZATION
The benefit of multi-task learning is that it uses relationships across tasks to improve the performance of a single task. Metadata is useful for improving multi-task learning performance, but effective integration is an additional challenge. Sodhani et al. [321] showed state-of-the-art results in Meta-World, which consists of a challenging multitasking benchmark. It learns expressions that are interpreted as metadata and helps provide context to tell which expressions to construct and how to construct them. Fu et al. [322] proposed a framework LeTS that utilizes multi-task computation and parameter sharing for efficient fine-tuning. It decouples the computational dependencies of existing fine-tuning models with a neural architecture that reuses intermediate results and reduces computational demands by leveraging the sparsity feature of weight differences. Zhang et al. [323] proposed Hidden-Parameter Markov Decision Processes (HiP-MDPs), an explicit modeling method for this structure, to improve sample efficiency in multitasking settings. In the HiP-MDP setting, they utilized the idea of a common structure and extended to enable state abstraction inspired by block MDP.
Dollar et al. [324] proposed a simple and fast complex scaling strategy that scales the underlying convolutional network to give greater computational complexity and, consequently, expressive power, extending the model. It provides a framework for analyzing scaling strategies under various computational constraints. Ruiz and Verbeek [325] proposed Hierarchical Neural Ensembles (HNE) to handle scenarios where the amount of computation and input data varies with time. It includes an ensemble of multiple networks in a hierarchical tree structure that shares an intermediate layer. As a hierarchical distillation to increase the prediction accuracy of small ensembles, the overlapping structure of the ensembles is utilized to allocate accuracy and diversity across individual models optimally.

e: INTEGRATION OPTIMIZATION
GPU performance and efficiency of recommendation models are affected by model architecture configurations (e.g., dense and sparse features and MLP dimensions). Acun et al. [326] described the complexity of using GPUs for training recommendation models, factors influencing hardware efficiency at scale, and a new scale-up GPU server design from Zion. Silent Data Corruption (SDC) is a negative impact on largescale infrastructure services. Dixit et al. [327] provided a debug flow based on the root cause and classification error guidance within the CPU using case studies as an explanation of how to debug this class of errors.
Vanilla NAS provides real-world performance, as each architecture is evaluated through training from scratch, but it is time-consuming. Zhao et al. [328] showed that one-shot NAS significantly reduces the computational cost by training only one supernetwork to approximate the performance of all architectures in the search space through weight sharing. To mitigate unwanted joint adaptation, they proposed several NAS using multiple supernetworks, called sub-supernets, each covering different areas of the search space. Stage 2 NAS needs to sample from the search space during training, which directly affects the accuracy of the final searched model. Uniform sampling has been widely used for simplicity but is agnostic to the model performance Pareto front, which is the primary focus of the search process, thus missing the opportunity to improve model accuracy further. Wang et al. [329] proposed an AttentiveNAS that focuses on enhancing the sampling strategy.
MBRL algorithms are complex due to the separate dynamic modeling and follow-up planning algorithms.
Consequently, when possessing dozens of hyperparameters and architectural choices, significant human expertise is required before applying them to new problems and domains. Zhang et al. [330] used automatic hyperparameter optimization (HPO) to improve performance compared to using static hyperparameters fixed for the entire training during training itself. Zhang et al. [331] studied how expressive learning can accelerate reinforcement learning from rich observations (e.g., images) without relying on domain knowledge. The bi-simulation metric quantifies the behavioral similarity between states in continuous MDP and trains the encoder so that the distance in the latent space is equal to the bisimulation distance in the state space. For robust, fast, and scalable binary optimization, Panchenko et al. [332] proposed Lightning BOLT, an improved version of the BOLT binary optimizer that significantly reduces the processing time and memory requirements while maintaining the efficiency of the BOLT, which enhances the performance of the final binary.

f: OPERATION CONSIDERATION
Empirical risk minimization (ERM) is generally designed to perform well for mean loss so that the estimator is sensitive to outliers, does not generalize, and treats subgroups unfairly. Li et al. [333] explored the problem through an integrated framework called TERM (Tilted Empirical Risk Minimization) to increase or decrease the impact of outliers, respectively. They show that TERM is used in a variety of applications (e.g., enhancing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance).
Fairness and robustness are two important concerns of federated learning systems. Robustness to data and model poisoning attack and fairness are the constraint to compete in statistically heterogeneous networks, as measured by the uniformity of performance across devices. Li et al. [334] proposed to use Ditto, a simple and general framework for personalized federated learning and developed an extensible solver for this. To understand and improve fault tolerance training for deep learning recommendations with partial recovery, Maeng et al. [335] optimized CPR, a partial recovery training system for a recommendation model. It relaxes consistency requirements and improves failurerelated overhead.
Unexpected reboots disrupt services running on the hardware and reduce fleet availability. A server reboot is also an important signal that indicates an underlying problem (e.g., a memory leak in service, a catastrophic hardware failure, a power outage) in a data center. Lin et al. [336] provided a large-scale, near-real-time reboot-monitoring framework that supports machine learning-based anomaly detection and automated root cause analysis for hundreds of server attribute combinations. Xia et al. [337] proposed Facebook's risk-focused backbone management strategy to ensure high service performance during the COVID-19 pandemic. It has been shown to achieve high service availability and low path scalability while resiliently withstand stress tests and handles traffic spikes efficiently.
Oughton et al. [338] anticipated that 5G would remain the preferred technology for wide-area coverage, while Wi-Fi 6 will remain the preferred technology indoors thanks to its much lower deployment costs. To address the problem of packet loss affecting a wide range of applications using Voice over IP (VoIP), Lin et al. [339] proposed prediction and mask training to improve the performance of the CRN framework. It outperforms the reference system using only the LSTM layer in terms of two objective metrics: speech quality (PESQ) and short-term objective intelligibility (STOI). The CRN consists of a convolutional encoderdecoder structure and an LSTM (long short-term memory) layer that is suitable for real-time speech enhancement applications.
Applying homogeneous encryption (HE) to the clientcloud model allows the cloud service to perform inference directly on the client's encrypted data. However, HE satisfies privacy constraints, but it introduces enormous computational problems in the current system. Reagen et al. [340] introduced Cheetah, a set of algorithms and hardware optimizations for server-side HE DNN inference to approach real-time speed. Automatic compilation of an efficient HE kernel in a synthetic compiler for vectorized isomorphic encryption is relatively unexplored. Cowan et al. [341] proposed an optimizing compiler, Porcupine, which uses program synthesis to generate vectorized HE code. Porcupine captures the underlying HE operator behavior and automatically infers the complex trade-offs imposed by these issues to develop an optimized and validated HE kernel.

6) METAVERSE APPLICATIONS a: SIMULATION
Carrying suspended payloads is difficult for autonomous aircraft, and rapid in-flight adaptation to payloads with physical properties unknown a priori remains an open question. Belkhale et al. [342] proposed a meta-learning approach that learns to learn a modified dynamics model within seconds of flight data after connection. One way to infer the safety of a robot is to build a safe set through Hamilton-J, but because of the long computation time, it sometimes assumes perfect knowledge of the mechanics, and the safety set is calculated offline. Shih et al. [343] proposed a new framework for learning safety control policies from simulation and using it to generate online safety sets from uncertain dynamics. As climate change increases the frequency and severity of natural disasters, response organizations need improved data to better understand the dynamics of disaster impacts. Giraudy et al. [344] are leveraging Facebook Location History (LH) data as part of its disaster mapping initiative to enable location-based services (e.g., Nearby Friends, location-based advertising) and social value products (e.g., disaster maps) to help people locate. 4238 VOLUME 10, 2022 b: GAME Diplomacy is a game of switching alliances that involves both cooperation and competition, which is not successful in large-scale games involving collaboration. Gray et al. [345] described a media-free Diplomacy transforming agent that combines supervised learning on human data with one-step preview search through minimizing external regrets.

c: OFFICE
Ha-Thuc et al. [346] discussed how these systems evolve from traditional formulations by incorporating producer values into goals. Jointly optimizing the ranking function for both consumer and producer value is a new direction and raises many technical challenges. They make the layout an end-to-end solution and describe the results of applying it to Facebook Marketplace. Blackshear et al. [347] proposed a method for blockchain asset owners to recover their funds if their private key is accidentally lost or sent to the wrong address. They achieve this with a Commit, Reveal, Claim, and Challenge smart contract that allows access to funds at addresses where the spend key is unavailable. The auction market introduces the concept of speed balance by reinterpreting the process of applying a coefficient between 0 and 1, which equalizes bids in all auctions on behalf of each buyer. Conitzer et al. [348] showed that calculating the social welfare maximization and profit maximization rate equilibrium is NP-hard but presents a mixed-integer program (MIP) is used to find a balance that optimizes several related goals. It uses static MIP solutions to improve the results achieved with dynamic pacing algorithms using instances based on real auction markets.

d: SOCIAL
Online social network (OSN) accounts exhibit many demographic attributes (e.g., age, gender, location, and occupation). Onaolapo et al. [349] devised a method to instrument and monitor stolen social accounts to understand the impact of demographic characteristics on attacker behavior. Cybercriminals accessing teen accounts create more messages and posts than cybercriminals accessing adult accounts, while attackers compromising male accounts destroy, including changing some of their profile information. Cybercriminals accessing female accounts appeared to be engaged in hostile activity. Bailey et al. [350] explored the spatial structure of social networks in the New York metropolitan area, where a significant proportion of city dweller connections are with nearby individuals. By examining the importance of transport infrastructure, they document significant heterogeneity in the geographic extent of social networks and show that this heterogeneity is correlated with public transport use. In the present state of sharing both temporary and permanent content on social media platforms, Luria and Foulds [351] discussed our findings on the short-term and long-term transitivity as part of social media experiences and the evolving identities of teens and young adults. As long as proportionality is not violated, there are greedy algorithms that involve volunteers and non-adaptive methods that include volunteers with trait-only probabilities assuming that the distribution of common traits in the volunteer pool is known. Although this distribution is not known a priori, Do et al. [352] proposed a reinforcement learning-based approach for online learning.

e: MARKETING
Fernanda [353] proposed the knowledge framework by using a mix of quantitative and qualitative methods to explore the current state of diversity and representation in online advertising and people's attitudes to the impact of diversity on digital campaign performance. More frequent and positive portrayals of underrepresented and diverse groups have a significant positive impact on business outcomes. It is important to optimize advertisers' budgets for campaigns across platforms without knowing the value of serving ads to users on multiple platforms. Avadhanula et al. [354] provided a regret algorithm for individual bid spaces. The generalization of existing MAB algorithms (e.g., Upper Confidence Bound and Thompson Sampling) does not perform well in two applications: the intelligent SMS routing problem and the advertising audience optimization problem that many businesses (especially online platforms) face. Sinha et al. [355] presented a simple variant of explore-thecommit and improved performance by setting a near-optimal regret range for this algorithm.

f: EDUCATION
Because simulation provides the ability to train a large number of robots in parallel and provides rich data, Truong et al. [356] used educational simulations before deploying the robots. They proposed bidirectional domain adaptation (BDA), an approach that connects the sim-vs-real gap in both directions for point goal navigation. They use Real2sim for bridging the visual domain gap and sim2real for linking the dynamic domain gap.

7) DISCUSSION AND OPEN CHALLENGES
As mentioned above, Facebook research is a research group with a lot of interest in Metaverse, as shown in Table 2. It has broad elemental technologies for natural language, vision, dialogue, and embodiment. It also has a foundation and experience that is expanded into a Metaverse platform with a Facebook social network service. Essential models for Metaverse are Blenderbot [286] based on PariAI, Detectron 2 [357] capable of fast visual recognition, and Habitat [300] that operate an agent from an eco-centric point of view. They provide services by launching its own Metaverse platform, Horizon and Infinite Office. The virtual currency Dime not only serves as a bridge between reality and the Metaverse but also leads to a sustainable ecosystem.

VI. DISCUSSION AND OPEN CHALLENGES
In this section, we discuss current problems and technologies needed in the future for Metaverse in the aspect of influence, limitations, and open challenges.

A. METAVERSE INFLUENCE FOR USER AND SOCIETY 1) SENTIMENT AND SOCIAL INFLUENCE
People can lead a stable cyber life in the Metaverse because they can distinguish between real-life and virtual life, just as a person did not feel confused while watching an avatar movie. However, because avatar design has emotional barriers, users may feel a sense of rejection towards the avatar if it cannot overcome Uncanny Valley like in the Alita movie.
Memories are beautiful because they are exquisitely crafted memories but because they are traces of time that cannot be returned. However, Metaverse recreates the past, and users can make different choices, giving people psychological stability and emotional recovery.
While the social impact depends on the ecosystem, it is important to consider many aspects of social impact, including potential exacerbation of social inequities, computation demands, economics, legality, and ethics. In addition, limited resources in the real world bring excessive competition and social side effects. However, in the Metaverse, it has an advantage over the real-world system in terms of item production and resources. It is possible to use infinite resources that can be created indefinitely online rather than a deduction compensation from limited resources in the reallife world. This is different from the reward system in the real world, where you must give up the other to get one, so it can reduce competition between users and is an opportunity to develop for the common good.

2) USER PARTICIPATION AND BENEFIT
In the Metaverse, users are less limited by time and space and can exist in multiple places through avatars, so the communication style is changed from 1:N broadcasting to 1:1 interaction. Avatar in the Metaverse provides a way to replace and complement users. The Metaverse is most effective in places like Africa where experiential education is difficult (e.g., undeveloped areas). In addition, High-contrast vision, long-distance vision, and volume augmentation for people with visual difficulties enable people with disabilities to live the same lives as ordinary people in Metaverse.
Mask effects (i.e., hide shapes, colors, and races) are also noteworthy, providing a better-than-realistic user engagement experience. Origin, gender, skin color, and appearance can be prejudicial when it comes to debates, psychiatric group therapy, and jury attendance in court. In this case, the avatar's neutral appearance is a good example of the Metaverse's social influence, which allows for a fairer opinion and participation in social consensus without prejudice.
In order to maintain a sustainable social ecosystem, user participation is important, so it is essential to provide fashion, games, and events on a regular and long-term basis. For example, in ZEPETO, users take selfies, solve quizzes, create dramas, and have fun designing costumes. In Metaverse sports and gun games, it is possible to induce and increase user participation by providing a third-person view rather than a first-person spectator mode.

3) MORE APPLICATIONS
Metaverse can significantly contribute to a multitude of applications and domains. However, for sustainable Metaverse application, we must consider interpretability, security, privacy, societal function, and ethics. More Metaverse applications help people work smart (e.g., telemedicine, layer separation and tagging for complex organ surgery, commuting to work simulation, remote problem identification, mapping to real environments without the need to find manuals). Metaverse makes living easy (e.g., senior public transportation simulation, intuitive simplification of digital input interfaces for the elderly, immersive education more effective than a video, counseling personal issues with masked avatar). Metaverse applications reduce physical object and space (e.g., non-shared private messaging in the desired place, providing information through overlay display of offline objects, virtual screen, store inventory trends, sales volume display, virtual display for IoT devices, and smart home applications).

B. METAVERSE LIMITATION 1) SUSTAINABILITY
Many advantages and applications have been described, but the sustainability of Metaverse is important. When the world's population is maintained at a certain level, it can grow and fix problems, but when the number of users accessing decreases, the world cannot be maintained. In the concept of life logging, the sustainability of various social relationships is more important than each event and task (e.g., games and simulations). In order to maintain continuity, a connection relationship (e.g., Metaverse access, messenger) must be maintained continuously in a relatively low-spec mobile device that can always be accessed. Using an episodic memory that effectively manages the user's log allows the user to feel the comfort and advantage of accessing Metaverse for a long time. Storing all experiences in memory storage has limitations in utilization and capacity, so memory research on effectively finding and reusing important episodes is needed. In addition, latecomer platforms should consider import/export methods that bring the existing user experience and provide continual usability.

2) HARDWARE AND SOFTWARE LIMITATIONS
In terms of a sensor in hardware, while the Metaverse resembles the real world a lot, some sensations are better felt in real life (e.g., day sunlight, smell, stickiness, slippery, wind). In terms of software, programs developed in the Metaverse without coding are used as a basis for high compatibility in the Metaverse world. However, as the program becomes more sophisticated, it faces the limit of sophistication in a complex application.
In terms of content, the dialog is developing into a longer and more natural form of conversation based on persona, but it is still limited as a sustainable lifelonglearning conversation solution with various perspectives and philosophies beyond exciting conversations. Humans basically have multi-personas, and they are expressed differently depending on time and place. Therefore, it is necessary to study more complex persona modeling in consideration of the situation. From this point of view, environments and events are important to show the various personas of users and NPCs. For example, in the drama Westworld, avatars perform various actions in the Metaverse, freeing from the constraints and conditions of reality. Therefore, NPCs in the role of residents of the Metaverse must be able to cope with various unexpected situations because the allowable range of scenarios is wide. In addition, the persona's design is important for the NPC to appear as if they choose with their persona and will. NPCs can be in the form of humans and various living (e.g., horse, dog, cat) and non-living forms (e.g., desk, clock).

3) DEVELOPMENT HUDDLE
In Metaverse, since it is a comprehensive solution in which various tasks occur simultaneously in a complex form (e.g., multi-mode and multi-task), there is a lot of work to study for individuals to start development without experience. From the perspective of Metaverse development, there are few online resources to learn, especially for novice developers. There is not enough information for practical details to make complex and realistic implementations (e.g., object selection, conditional actions, user storyboards with scene flow, teleportation between scenes, movement, and dialogue). For this reason, a collaborative system (i.e., a platform and developer community) for an individual developer is important to co-develop without designing the entire system. As for the platform, a commercial platform (e.g., Roblox) with favorable maintenance and an open source-based platform (e.g., Unity) with various possibilities are considered. Since the scope of the technology target of Metaverse is wide, it is necessary for the developer community to separate threads based on well-organized taxonomy and maintain a group of experts who lead in each technical domain.
C. OPEN CHALLENGES 1) MEDIUM SELECTION AR uses lightweight devices, suitable for short experiences, but VR relatively needs heavy and expensive devices for long experiences. Some approaches switch between AR and VR in one piece of hardware by mixing the advantages of AR and VR. Although this method has the advantage of using AR and VR in alternative ways, it becomes expensive and heavy compared to a single model device. Alternatively, holograms are not a popular technology in Metaverse, but they have potential.
Eye-worn lenses are another input method utilized in the Metaverse (e.g., Maya Lenz, Mirage, Mojo lenz). The lens analyzes the user's information by tracking the direction of eye movements, focus, blinks, and winks. For example, Maya Lenz is a wearable device in the form of a contact lens, and Mirage is a way of expressing disliked content by replacing it with positive alternatives. Mojo lenz is used in conjunction with an assistive device worn around the neck to seamlessly process a variety of visual information (e.g., data feeds, people's profiles, video calls, translations, notifications) into the wearer's vision.

2) ETHICS AND SECURITY
Privacy and security are critical issues because Metaverse collects data on behavior that is more detailed than user conversations and internet history. Avatar two-factor authentication and protection of transmitted data are essential, and we need to be more vigilant with regard to crimes that may occur on the Metaverse. In addition, surveillance actions (e.g., inappropriate chat room surveillance, censorship, and follow-up review) due to the surge in users suggest that organizations that play the same role as police and government are needed in the real world. There are some instances where exemplary people in the real world commit crimes based on their online anonymity in the Metaverse. The norms and restrictions of the Metaverse may differ from those in the real world because they have a post-nationalism and degrees of freedom. Most users familiar with the Metaverse are the young generation with relatively various social ideas. It is necessary to build a Metaverse with a worldview and ethical consciousness in which various avatars can live, rather than a Metaverse as a physical space.

3) INTERDISCIPLINARY RESEARCHES
Since the Metaverse consists of a world that changes in real-time for a large number of users and NPCs, crossdisciplinary research is necessary. As an example of crossdisciplinary research, Metaverse leverages knowledge widely used in cognitive science (e.g., episodic memory, intrinsic motivation, and theory of mind) to provide more immersive and sustainable services. Episodic memory occurred a long time ago in the present conversation and induced a natural conversation. Intrinsic motivation allows an agent to perform multiple tasks rather than a single task consistently. The theory of mind has the advantage of deepening conversation to understand from the other person's point of view.
Other examples are the social sciences, psychology, and economics. The environment in which a certain number of members live using masked avatars differs from how society currently operates. Neuroscience and psychological approaches for psychotherapy are used to understand humans and maintain a Metaverse deeply. The virtual currency of Metaverse is different from the virtual currency in the real world in that it is used as a real product in a virtual environment, so it can become a new variable from the point of view of economics and develop into a fused form.

VII. CONCLUSION
In this study, we analyzed research for similar concepts of Metaverse in Metaverse, avatar, and XR. After that, we comprehensively dealt with the necessary three components (i.e., hardware, software, and contents) for Metaverse. We also reviewed the latest trends of Metaverse approaches (i.e., user interaction, implementation, and application) that were currently available and necessary in the future. Interacting as part of the story is important rather than seeing wellformed storytelling and immersive visual effects. We applied taxonomy to three famous Metaverse domains (i.e., movie, game, and researches) in Ready Player One, Roblox, Facebook Research. Finally, we discussed the aspect of social influence, limitation, and open challenges.
From a future-oriented perspective, Facebook research tries to input text using the output of the peripheral nervous system and brain-computer interface. As a direct connection method, Neuralink is a way to enhance communication with devices by implanting a chip in the human brain. The current stage of development is to the extent that it is possible to directly stimulate a specific part of the brain and look at a simple type of EEG. However, the continuous development of brain-computer-interface and Neuralink can develop into a form that gives an experience that is difficult to distinguish from reality in the Metaverse (e.g., the method of connecting to the spine from the matrix). APPENDIX   TABLE 3. List of main acronyms. VOLUME 10, 2022