ILLIXR: An Open Testbed to Enable Extended Reality Systems Research

We present Illinois Extended Reality testbed (ILLIXR), the ﬁ rst fully open-source XR system and research testbed. ILLIXR enables system innovations with end-to-end co-designed hardware, compiler, OS, and algorithms, and driven by end-user perceived Quality-of-Experience (QoE) metrics. Using ILLIXR, we provide the ﬁ rst comprehensive quantitative analysis of performance, power, and QoE for a complete XR system and its individual components. We describe several implications of our results that propel new directions in architecture, systems, and algorithms research for domain-speci ﬁ c systems in general, and XR in particular.

W ith the end of conventional CMOS scaling, domain-specific acceleration has emerged as a key architectural technique to meet the requirements of emerging applications. A parallel trend is the rise of new application domains increasingly deployed on resource-constrained edge devices, where they interface directly with the end-user and the physical world (e.g., robotics, virtual reality, and autonomous vehicles). In response to these trends, our research conferences have seen an explosion of papers on efficient accelerators.
To truly achieve the promise of efficient domainspecific edge computing, however, will require architects to broaden their portfolio from individual domain-specific accelerators to domain-specific systems. Such systems may consist of multiple interacting subdomains (or applications), requiring multiple accelerators that interact with each other to collectively meet end-user demands. Architects must also be cognizant of the programming stack that must grapple with this heterogeneity and the runtime that must manage the heterogeneous architectural resources. Meeting the end-user quality demands of such systems will likely require co-designing the hardware, compiler, and runtime along with the application.
The principled design of such systems requires a new approach to architecture research, based on a generalizable and scalable science for designing end-to-end quality-driven and end-to-end hardware-software-application co-designed domain-specific systems. The foundation for such a science rests upon the availability of system testbeds that can enable an understanding of the end-to-end requirements of such application domains, and enable prototyping and experimentation of innovative techniques to meet these requirements.
We argue that virtual, augmented, and mixed reality, collectively referred to as extended reality (XR), is a domain of increasing societal importance that is in need for architects to embrace such research, and provides the foundation to enable such research.

CASE FOR XR AS A DRIVING DOMAIN
1) Pervasive: XR is envisioned to be the next interface for most of computing and to transform many aspects of our lives; e.g., teaching, medicine, science, and entertainment. researchers a potentially rich space to innovate. Table 1 summarizes various system-level qualityrelated metrics for the state-of-the-art XR devices and the aspiration for ideal future devices. 3) Multiple and diverse components: XR involves several diverse subdomains-e.g., vision, robotics, graphics, machine learning, optics, audio, and video-making it challenging to design a system that executes each one well within available resources. 4) Full-stack implications: The combination of realtime constraints, complex interacting pipelines, and ever-changing algorithms creates a need for full-stack optimizations involving the hardware, compiler, operating system, and algorithm. 5) Flexible accuracy for end-to-end user experience: The end user, being a human with limited perception, enables a rich space of accuracy-aware resource tradeoffs, but requires the ability to quantify impact on end-to-end experience.

CASE FOR AN XR SYSTEM TESTBED
A key obstacle to architecture research for XR is that there are no open-source benchmarks providing the entire XR workflow. As we move from the era of general purpose, homogeneous cores-on-chip to domainspecific, heterogeneous system-on-chip architectures, benchmarks need to follow the same trajectory. While previous benchmarks comprising suites of independent applications (e.g., PARSEC, Rodinia, SPEC, and SPLASH) sufficed to evaluate general-purpose architectures, there is now a need for a full-system benchmark methodology, better viewed as a full system testbed, to design and evaluate system-on-chip architectures. Such a methodology must bring together the diversity of components that will interact with each other in the domain-specific system and also be extensible to accept future new components. An XR full-system benchmark or testbed will continue to enable traditional research for accelerating a given XR component with conventional PPA metrics, but will additionally allow evaluations for the end-to-end system impact. More importantly, the integrated system will enable new research that co-designs acceleration of its multiple, diverse, and demanding components with the full-system stack, driven by end-to-end user experience. We present 1) Illinois Extended Reality testbed (ILLIXR), 1 the first fully open-source XR system and testbed consisting of the state-of-the-art XR components orchestrated by a modular and extensible communication interface and runtime; 2) the first detailed quantitative characterization of performance, power, and Quality-of-Experience (QoE) metrics for a complete XR system on desktop and embedded class machines; and 3) several resulting future directions for architecture and systems research that are enabled by ILLIXR. Figure 1(a) presents ILLIXR. It contains three interacting pipelines-perception, visual, and audio-each with state-of-the-art components that belong to a modern XR runtime (e.g., Meta VR) and are shipped with an XR headset. The ILLIXR components interact with each other through the ILLIXR communication interface and runtime illustrated in Figure 1(b). ILLIXR supports the emerging and increasingly popular Khronos OpenXR API for XR applications (e.g., games running on a game engine, which in turn runs on

Pipelines
The perception pipeline translates the user's physical motion and surrounding world into information understandable to the system. It uses input from sensors [e.g., cameras and inertial measurement units (IMUs)] and consists of components, such as visual inertial odometry (VIO) for obtaining low frequency but precise estimates of the user's pose (the position and orientation of their head), IMU integration for highfrequency pose estimates, eye tracking for the user's gaze, and scene reconstruction to build a 3-D model of the user's surroundings. The visual pipeline obtains the user's new pose from the perception pipeline and the frame from the rendered application and produces the final display, after compensating for rendering latency and optical distortions. ILLIXR supports computational holography, but until holographic displays are more widely available, ILLIXR displays its frames on a standard LCD monitor or the North Star AR headset (an open source headset display). The audio pipeline generates 3-D spatial audio using the pose from the perception pipeline.  before its next invocation. The right-hand side illustrates the components' interactions through their synchronous (solid) and asynchronous (dashed) dependencies.

Runtime and Communication Framework
An XR system is unlikely to follow the idealized schedule due to shared and constrained resources and variable running times. Thus, an explicit runtime is needed for effective resource scheduling while maintaining intercomponent dependencies, resource constraints, and QoE. The ILLIXR runtime schedules resources while enforcing dependencies among components, in part deferring to the native (Linux) kernel and GPU driver. For extensibility and modularity, ILLIXR provides a well-defined communication framework, with components implemented as plugins. The framework is structured around event streams, each supporting writes and asynchronous and synchronous reads, implemented via copy-free shared memory for efficiency. Plugins are distributed as shared-object files and, for modularity, are given access to other plugins only through event streams. A plugin is interchangeable with another as long as it complies with the eventstream interface. Researchers can test alternative implementations of a plugin without needing to reinvent the rest of the system. Development can iterate quickly because plugins are compiled independently.

Metrics
ILLIXR provides several metrics for evaluation. In addition to conventional performance metrics, such as per-component frame rate and execution time, ILLIXR reports several QoE metrics, such as motion-to-photon latency (a standard measure of lag between user motion and image updates), and SSIM and FLIP for image quality. While ILLIXR currently implements SSIM and FLIP, its pose and image collection infrastructure is generic and extensible, enabling evaluation of other (evolving) metrics for image or video quality. This is important as such metrics for XR are still an active area of research.

EXPERIMENTAL METHODOLOGY
We chose to run ILLIXR on two hardware platforms, with three total configurations: 1) a high-end desktop  platform; 2) an NVIDIA Jetson AGX Xavier in both high performance; and 3) low-power mode representing a broad spectrum of power-performance tradeoffs and current XR devices. Table 2 summarizes the key parameters for ILLIXR that require system-level tuning, the range available in our system for these parameters, and the final value we chose for our experiments.
For the perception pipeline, we connect a ZED Mini camera to the abovementioned platforms via a USB-C cable. For the visual pipeline, we run representative VR and AR applications-Sponza, Materials, Platformer, and a custom AR demo application with sparse graphics-on the Godot game engine on ILLIXR; these applications interact with the perception pipeline to provide the visual pipeline with the image frames to display. ILLIXR can display the (corrected and reprojected) images on both a desktop LCD monitor and a North Star AR headset connected to the abovementioned hardware platforms. For the audio pipeline, we use prerecorded input. Figure 2(a) shows the setup with a North Star AR headset (used only as a display) and a ZED Mini camera attached to a backpack PC running ILLIXR. Figure 2(b) shows a frame seen by the user. Figure 3 shows a development set up with a desktop LCD monitor and a simple application-it shows the stereoscopic images that ILLIXR presents to the user as well as a third-person debug view showing the virtual and physical environment.
The end-to-end integrated ILLIXR configuration in our experiments uses the components shown in Figure 1(a) except for scene reconstruction, eye tracking, and hologram. The OpenXR standard only recently added an interface for an application to use the results of scene reconstruction and eye tracking. We, therefore, do not have any applications available to use these components in an integrated setting. Although we can generate holograms, we do not yet have a holographic display.

RESULTS AND IMPLICATIONS FOR ARCHITECTURE AND SYSTEMS RESEARCH
Architects have embraced specialization, but most research focuses on accelerators for single programs. ILLIXR is motivated by research for specializing an entire domain-specific system. Our results characterize the end-to-end performance, power, and QoE of an XR device, exposing new systems research opportunities and demonstrating ILLIXR as a unique testbed to enable exploration of these domain-specific systems, as follows. (The results are described in more detail in Huzaifa et al.'s work. 1 )

Performance, Power, and QoE Gaps
Figures 4-6 quantitatively show that collectively there is a several orders of the magnitude performance, power, and QoE gap between current representative desktop and embedded class systems and the goals in Table 1. The gap will be further exacerbated with higher fidelity displays and more components for a more feature-rich XR experience (e.g., scene reconstruction, eye tracking, hand tracking, and holography).
While the presence of these gaps itself is not a surprise, we provide the first such quantification and analysis. This provides insights for directions for architecture and systems research (in the following section?) as well as demonstrates ILLIXR as a one-ofa-kind testbed that can enable such research. Thus, to close the aforementioned gaps, all components have to be considered together, even those that may appear relatively inexpensive in terms of one metric at first glance. Moreover, there is a diversity of tasks within and across components, and no single-task dominates. It is likely impractical to build a unique accelerator for every task given the large number of tasks and the severe power and area constraints for XR devices: leakage power will be additive across accelerators, and interfaces between these accelerators and other peripheral logic will further add to this power and area (we identified 27 tasks across all components, and expect more tasks with new components). At the same time, our results show that a number of common primitives exist across components, making the case for shared hardware across components. Figure 5 shows that addressing the power gap requires considering system-level hardware components, such as display and other input/output (I/O), including numerous sensors, as Sys power constitutes more than 30% of total power on Jetson-LP. While we do not measure individual sensor power, it is included in Sys power on the Jetson. This motivates research in unconventional architecture paradigms, such as on-sensor computing to save I/O power; e.g., the image processing tasks of VIO can be moved to the sensor so that only detected features and not entire camera frames are sent from the camera to the System-on-Chip (SoC). Figure 4(e) shows large variability in per-frame processing times in many cases, either due to inherent input-dependent nature of the component (e.g., VIO) or due to the resource contention from other components. This variability poses challenges to, and motivates research directions in, scheduling and resource partitioning and allocation of shared hardware resources. The components exhibit a variety of memory access patterns, which further complicates the design of shared resources. For instance, several components are memory bandwidth bound while others are more sensitive to memory latency, making it challenging to design a memory system that supports both types of components efficiently.

RESEARCH ENABLED BY ILLIXR
This work provides the research community with a novel, one-of-a-kind infrastructure and foundational quantitative analyses to enable a new era of research in domain-specific systems in general and XR systems in particular. We expect this work to shape a broad and long-term research agenda of the science of designing end-to-end quality-driven and end-to-end hardware-software-application co-designed domainspecific systems. ILLIXR and the research it can enable have the potential to particularly transform XR, an emerging domain of critical importance and likely to transform all the endeavors of human activity. We  highlight several research projects already using or considering using ILLIXR.

Representing Heterogeneous Parallelism in Software
The compiler intermediate representation (IR) is critical for performance and portability. Traditional IRs, such as LLVM, do not capture the parallelism or heterogeneity that is prevalent in current hardware architectures. HPVM 6 is a compiler IR that uses a hierarchical dataflow graph (with side effects) to capture several types of parallelism: task, streaming, nested, data, and fine-grained vector parallelism. The hierarchical dataflow graph nodes naturally and flexibly map to potentially heterogeneous compute elements, and the edges represent communication between the elements. There is an ongoing effort to compile all of ILLIXR to HPVM, providing a rich representation to develop techniques for a compiler and runtime to perform automated accelerator selection, software and hardware approximations, and local and distributed resource mapping for XR and other similar complete domain-specific systems (e.g., Zacharopoulos et al.'s work 7 ).

Automated Selection and Generation of Accelerators
ILLIXR is being used to develop techniques for automated accelerator selection for complex domain-specific system workloads, with tight performance, power, and area budgets. In initial work, the Trireme tool uses the HPVM IR representation to guide a design space exploration that considers the exposed loop, task, and streaming parallelism to select accelerators for a subset of ILLIXR components. 7 Moving forward, we plan on developing compiler analyses and transformations to determine common compute patterns across components to enable accelerator reuse, and automatically generate accelerator hardware and software.

Accelerator Communication Interface
Future SoCs will require multiple heterogeneous accelerators running in parallel to meet QoE of domains, such as XR. A shared memory programming model would be able to alleviate the need for programming complex DMA engines to explicitly orchestrate data movement between these accelerators. However, how to design cache coherence protocols, memory consistency models, and the memory and communication fabric of the SoC are open questions regarding the implementation of shared memory in a heterogeneous environment. To answer these questions, we are building upon the Spandex 8 heterogeneous coherence interface for coherence specialization, and using ILLIXR's diverse range of communication patterns to drive the design.

QoE-Driven Automated Approximation Selection
Current approximation techniques typically look at component-level or subsystem metrics, whereas, it is the end-to-end QoE that ultimately matters in many emerging domains. Determining whether and how a certain approximation will impact the end-to-end QoE, how errors will be composed across components, and how to tradeoff accuracy among components are all unanswered questions.

QoE-Driven Scheduling
In QoE-driven emerging domains, tasks have to be scheduled and resources managed to meet one or more QoE metrics. ILLIXR is well-suited for studying this phenomenon, as ILLIXR's task graph is a directed acyclic graph (DAG) with multiple critical paths and QoE constraints. ILLIXR is being used to develop a scheduler that automatically determines an optimal frame rate of each component and schedules components to meet QoE for a given hardware mapping. In the future, we aim to study the effects of approximations on the schedule, and determine minimum acceptable component frequencies in order to reduce resource utilization and power consumption.

Edge-Cloud Resource Partitioning
At the moment, all ILLIXR components run on the edge device. However, it is neither feasible to run all components on the edge device due to the limited power budget nor required since some components are inherently latency tolerant and can be offloaded to an edge server or the cloud. We are evaluating which components can be offloaded and how in a collaboration integrating ILLIXR with FleXR, an edge-assistance framework for XR. 10 Our end goal is to develop a methodology for offloading-driven hardware-software-algorithm co-design and perform real-time offloading decisions based on device resource usage and network capabilities.

Multiparty XR
Currently, ILLIXR supports a single end-user device. The full potential of XR is in multiuser applications, such as telepresence. We are expanding ILLIXR to support networked multiparty applications in collaborations integrating ILLIXR with FleXR and with ARENA, a distributed XR system that enables multiple users to have shared XR experiences. 11 The end result will be a full-stack multiparty XR system that will be fully open source, enabling the study of distributed XR applications and allowing for crossstack optimizations.

Other Research Directions
There are several other projects using or considering ILLIXR, including for AR security and privacy, lowlatency networks for XR, use of integrated sensing and compute through 2.5-and 3-D packaging techniques, QoE metrics for XR, cross-component co-design driving novel XR algorithms, and simulation techniques to use ILLIXR to drive novel architectures. An independent group has already published a paper in a top conference using ILLIXR for hologram acceleration. 12

ILLIXR CONSORTIUM
We have worked with many companies and academics to develop ILLIXR. The culmination of these interactions was the recent launch of the industry-backed ILLIXR consortium. b The consortium aims to democratize XR systems research, development, and benchmarking by: 1) evolving ILLIXR into a consensus reference opensource end-to-end XR system testbed backed by industry and the academic community; 2) providing a reference benchmarking methodology for XR systems, including reference system configurations, applications, datasets, and metrics; and 3) creating a community where the multidisciplinary XR systems R&D stakeholders come together.
The consortium already has several industry members-Arm, Meta Reality Labs, Micron, NVIDIA, and Project North Star-and we are in discussions with others. The advisory board also consists of several academics spanning the cross-disciplinary boundaries required for such work (in addition to architects). a A rendering technique that renders the center (fovea) of the image at full fidelity and the periphery at lower fidelity to save time and energy. b [Online]. Available: htt_ ps://illixr.org/