Beyond Von Neumann in the Computing Continuum: Architectures, Applications, and Future Directions

The article discusses emerging non-von Neumann computer architectures and their integration in the computing continuum for supporting modern distributed applications, including artificial intelligence, big data, and scientific computing. It provides a detailed summary of available and emerging non-von Neumann architectures, which range from power-efficient single-board accelerators to quantum and neuromorphic computers. Furthermore, it explores their potential benefits for revolutionizing data processing and analysis in various societal, science, and industry fields. The article provides a detailed analysis of the most widely used class of distributed applications and discusses the difficulties in their execution over the computing continuum, including communication, interoperability, orchestration, and sustainability issues.


Introduction
Computing technologies have profoundly impacted society, fundamentally transforming how people communicate and interact with their environment.With the rise of the Internet of Things (IoT) and the continuous evolution of modern communication technologies, digital integration has become an essential aspect of our lives.Despite these advancements, the fundamental principles of computer architecture did not change since the introduction of John von Neumann's storedprogram concept for the IAS machine in 1952.
Therefore, the traditional computing architectures primarily based on the storedprogram concept are colloquially referred to as von Neumann architecture.In the von Neumann architecture, programs and data are stored in a single operating memory and treated as the same unit.When introduced, this novel idea allowed simple hardware abstraction, making computer programming flexible and straightforward.
However, the physical separation of the processing units from memory through limited shared communication buses leads to increased communication latency and reduced throughput.The single communication bus between the processing unit and the shared instructions and data memory causes the socalled memory wall problem that limits processing efficiency.The memory wall appears due to the limited data transfer capacity and low transfer scalability between the processing unit and memory.As the communication bus can only access either the shared data or instruction memory at each point in time, for most application classes, the data transfer rate is inherently lower than the rate at which the processing units work.Therefore, the processing units constantly wait for the data to be read or stored in the memory.
To mitigate this problem, most modern computers differ to some degree from the von Neumann architecture by separating the data and instruction memory and introducing parallel data processing concepts, such as instruction pipelining and separate cache for data and instructions.Therefore, today's most common computer architectures are not strictly based on the von Neumann model.They provide the illusion of using a von Neumann programmer's model implemented over a separate execution pipeline and a modified memory management system.
As modern applications increasingly rely on data-intensive artificial intelligence (AI) algorithms, the limitations of traditional von Neumann architectures have become a critical barrier to further improvements in computational time, memory performance, and energy efficiency.Specifically, these AI algorithms require a large amount of fast memory, aggravating the memory wall bottleneck further.To illustrate this issue, consider the rapid evolution of the Generative Pre-trained Transformer (GPT) model.In just four years, the state-of-the-art GPT-2 model with 1.5 billion parameters has evolved to over 175 billion parameters for GPT-3, and in March 2023, to over 100 trillion parameters for GPT-4 [1].This highlights the urgent need for new computing architectures to support modern AI applications' demands regarding memory, low-latency computation and sustainable execution.
To address the issues of memory, high latency, and energy usage, many novel computing architectures beyond the traditional von Neumann model have been proposed in recent years.These new architectures, known as non-von Neumann architectures, range from power-efficient single-board AI accelerators to quantum and neuromorphic computers.These novel architectures have great potential for revolutionizing how we process and analyze data, leading to significant advancements in healthcare, transportation, and entertainment.
Despite their potential benefits, it is difficult to integrate non-von Neumann architectures with the concurrent well-established distributed computing paradigms, including cloud and edge computing and their amalgamation in the computing continuum [2].Their integration in the computing continuum remains challenging due to significant architectural heterogeneity, data representation, communication, and instructions execution.
Therefore, in this work, we provide (i) a summary of the modern non-von Neumann architectures, (ii) a classification of the most established application classes and their suitability for deployment on non-von Neumann architectures, and (iii) a discussion of the main research directions towards overcoming the limiting factors for integrating nonvon Neumann architectures in the computing continuum.

Summary of non-von Neumann Architectures
This section describes a detailed summary of the non-von Neumann architectures based on their architectural specifics.The differences between the presented architectures are summarized in Table 1, considering the computational models, processing paradigms, data representation and precision, and scalability.
We first define the non-von Neumann architecture as computer architecture that differs from the traditional von Neumann, which relies on a single shared memory for instructions and data.In non-von Neumann architectures, the memory model is usually distributed, divided into hierarchies, or even implements shared memory and processing components.

Classical non-von Neumann architectures
The classical non-von Neumann architectures encompass various digital computers that rely on binary data representation.These architectures provide a simple programming and memory model that mimics the classical von Neumann shared memory abstraction.It is important to note that the following architectures can be implemented both as von Neumann and non-von Neumann; however, we will refer in the following text to the nonvon Neumann implementations.We can generally divide these architectures based on their field of application, namely (i) general purpose and (ii) application specific.
The most common architectures are general-purpose, which are the core of various daily used computing devices.We can, therefore, highlight the following architec-tures, classified by the Instruction Set Architecture (ISA) and their memory organizations: • Artificial Intelligence (AI) accelerators, such as the Tensor Processing Units (TPU), are domain-specific computer architectures designed to train deep neural networks with lower precision [7].The AI accelerators are usually modular and utilize a systolic array to interconnect multiple devices in complex three-dimensional topologies.In the case of the TPUs, they include large 28 to 144 MB register files, specialized scratchpad memory, and a hierarchical set of cache memory.• Stream multiprocessor architectures are implemented for general-purpose graphics processors.While originally designed for processing graphics, as their name implies, today, they are implemented almost as a general purpose [8].These architectures rely on the single instruction, multiple threads (SIMT) approach, which allows for limiting the instruction fetching overhead.The most common architectures using this paradigm are Nvidia Ampere and AMD RDNA3.These architectures use specific caches, usually three levels (from L1 to L3), per streaming multiprocessor (Ampere), or compute unit (RDNA2) to overcome the memory wall and the limited processing performance.

Barriers to integration in the computing continuum:
Currently, the von Neumann and the classical non-von Neumann architectures represent the backbone of the computing continuum.This is due to their significant proliferation in the market, high availability, simple memory and programming models, and low purchasing price.However, due to the difference in architecture, network technologies and protocols, and various operating systems, there are still multiple barriers to their integration, including interoperability issues, application orchestration difficulties, and performance prediction instability, which we further discuss in Section 4.

Non-classical non-von Neumann architectures
Several factors limit classical computing architectures.These factors include power limitations, memory bandwidth, high heat dissipation, and non-sustainable manufacturing processes.Due to these limitations, academia and industry recently introduced numerous novel architectures based on technologies other than traditional semiconductors.These novel architectures explore significantly different approaches, such as using analogue data representation or even concepts from quantum mechanics, to encode and process information.
• Neuromorphic computing [9] refers to architectures that imitate human neurobiological processes through massively parallelized electronic circuits.Neuromorphic hardware is not based on the von Neumann memory model.Instead, it comprises neurons and synapses responsible for both processing and memory.Input neurons are charged with incoming analogue inputs (spikes) and eventually fire further spikes through the outgoing synapses, which in turn, charge other neurons.The timing and strength of the spikes can be modulated via synaptic weights.Neuromorphic computing facilitates massively parallel event-driven processing since each neuron and synapse is independent, and spikes are asynchronous.Moreover, due to event-driven design, a part of the hardware is inactive when the corresponding input signal

Summary of modern application classes
The non-von Neumann architecture in the computing continuum provides a suitable foundation particularly for handling the computational requirements of data-intensive tasks, enabling efficient and effective execution of data-driven applications.This section explores the three most widespread classes of modern data-driven applications.It dis-cusses their relation to the non-von Neumann architectures in terms of computational performance and fields of application.We also summarize the relationship between the application classes and their characteristic requirements, with classical non-von Neumann architectures in Table 2 and non-classical in Table 3.

Machine learning
In Machine Learning (ML), mathematical models are trained to recognize features of input data, e.g., image classification and object recognition.These models are trained on large quantities of data.They have been a key application for specialized architectures, such as stream multiprocessors, for years due to their increased processing power over general-purpose processing units [12].To increase the processing speed of machine learning applications on stream processors, general 32 and 64-bit floating point processing units are swapped to lower precision 16 and 8-bit units, such as TPUs.These compute optimizations have shifted the performance bottleneck further to the data path, a traditional problem for von Neumann architectures.More radical non-von Neumann architectures such as neuromorphic computing and quantum computing [9] can revolutionize the way we process and store information for machine learning.Due to their brain-inspired design, the former class of architectures can process artificial neural networks highly efficiently.Notably, deep learning in the form of spiking neural networks [13] is particularly suitable for neuromorphic hardware.The latter class, i.e., quantum computing, can also offer significant speedup thanks to quantum parallelism [14].

Scientific computing
Scientific computing is applied in various domains, including linear algebra, molecular dynamics, material sciences, and drug design.Emerging non-von Neumann architectures can accelerate such scientific applications with high-end computational and performance requirements.SPARC and RISC-V processors can enable optimizing data move-ment and computation, minimize latency and maximize throughput for complex numerical simulations.Similarly, TPUs can be used in scientific applications to accelerate matrix operations in deep learning and train large neural networks.
Non-classical quantum-based non-von Neumann architectures also find their usage in scientific applications, thanks to the proven theoretical speedup for different scientific problems and the native modeling of many scientific phenomena in quantum programs [15].Quantum techniques such as quantum matrix inversion [16] can be used to solve linear systems of equations and perform linear algebra operations much faster than classical computers.Quantum computing has also shown its potential utility in the form of an accelerator and for modeling complex processes in molecular dynamics and material science.Examples include accelerating eigenvalue and Euclidean distance matrices calculation in target molecular dynamic workflows, modeling of quantum properties of microscopic particles, molecular simulation, and speeding up the discovery process in drug design [17].

Big Data Analytics
Big data analytics involves collecting, storing, processing, and analyzing large volumes of data to extract valuable insights, identify patterns, and make data-driven decisions in various industries.
Big Data applications can benefit from using non-von Neumann architectures for accelerated data analytics.Recently, novel concepts have been proposed for supporting Big Data analytics for robotic systems that utilize neuromorphic processors to support the decision-making process of the robots while performing the data-heavy analysis in the cloud.In general, neuromorphic computing is being explored to enhance Big Data analysis for time-sensitive operations, particularly in medicine and industry.Specifically, analoguebased analytics significantly reduces the computation complexity of big data applications by reducing algorithms' space and time dimensions.Furthermore, streaming multi-

Future research directions for adopting non-von Neumann architectures in the computing continuum
The following section discusses the barriers to adopting non-von Neumann architectures and possible future research directions.

Device heterogeneity
The heterogeneity of the non-von Neumann architectures hinders their transparent integration in the computing continuum.As described in Section 2, due to the specific implementation of the most common nonvon Neumann architectures, the computa-tional performance is highly affected by the exact technical implementation and the nature of the utilized algorithms.For example, neuromorphic is well suited for a class of machine learning applications.However, classical non-von Neumann architectures, such as general purpose ARM and x86, can support a much larger set of applications, albeit with lower computational performance.Therefore, managing complex distributed applications over computing continuum infrastructures containing various non-von Neumann computing nodes requires re-writing the source code and compilation for the set of suitable architectures.In practice, this could be difficult to achieve for a vast set of applications.Therefore, it is relevant to explore novel approaches to research for analyzing the similarities between the architectures in the computing continuum based on the programming and data models.

Communication and data movement
The unprecedented and sudden improvement in computational power at the network's edge due to the integration of nonvon Neumann architectures might require significant data transfer between the different systems.This can result in high latency and data transfer costs, particularly for realtime applications.Consequently, the application providers might incline edge-only deployments to mitigate these issues.Furthermore, non-von Neumann architectures require adaptive communication protocols that can accommodate drastically different systems.This is particularly true for architectures capable of processing analogue or mixed signals internally, including neuromorphic and quantum computers.The currently available protocols are static and use limited rules to enable communication.They are inefficient or even incapable of data transmissions in the form of spikes or qubits.The fragility of such data, especially quantum information that environmental factors can easily disrupt, would also limit the communication range, at least in early implementations.Therefore, exploring novel adaptive protocols that utilize AI approaches to accommodate highly heterogeneous systems is an essential research track.

Interoperability
With von Neumann and non-von Neumann architectures offering highly variable performance for different application classes, different applications can be better optimized for a specific architecture, reducing the need for strict interoperability between architectures.However, with the variety in modern architectures increasing, application developers might still want to support multiple architectures in case users cannot access specific sparsely available resources.For example, although quantum computing supports machine learning very efficiently, the current scarcity of quantum computing machinery might push the application developers to support an x86 source to ensure that the application can be executed when no quantum computer is available or the waiting times are long.Moreover, the optimal target architecture might differ depending on conditions; for example, with machine learning, deployment on general-purpose ISA (ARM, x86, SPARC, RISC-V) might be preferable to stream processors (RDNA, AMPERE) because of their significant memory capacity, even though the computing capacity is generally superior for the given type of applications.
Furthermore, when fitting the computing continuum with various architectures, these devices need to be managed by shared software components such as resource managers and operating services, adding a need for interoperability.This shared management requires applications, their data, and possible isolation mechanisms (containers, virtual machines) to be uniform to allow management services to operate each application's lifecycle similarly, independent of the target architecture.The alternative would be to deploy management software for each specific architecture.However, this would remove any opportunity for interoperability, which is especially important for workloads consisting of multiple services deployed across multiple architectures.

Application orchestration and systems adaptation
Applications scheduling and orchestration across the computing continuum is a process that requires identifying proper resources and considering the requirements of each application in terms of computational performance, available memory, and network bandwidth, among others.Many of the non-von Neumann computers have fundamentally different computational models, memory structures, and communication patterns.This results in challenges in developing efficient and effective scheduling algorithms to handle these systems' constraints and limitations.For instance, in a neuromorphic computer, the data processing occurs massively parallel, with many neurons simultaneously computing on a large dataset.This makes it difficult to schedule tasks and allocate resources, as there is no clear separation between computation and communication.In quantum computers, the scheduling problem is exacerbated by the probabilistic nature of quantum states and quantum gates, which requires careful consideration of the order and timing of operations to ensure that the computation is correct.
Furthermore, the reliance on the orchestration approaches on real-time monitoring data aggravates the problem further.There is a multitude of monitoring platforms available; however, none of those supports the monitoring of quantum or neuromorphic computers.Besides, defining unified monitoring metrics can be difficult.For example, it is challenging to cross-quantify the resource utilization rate for TPU devices, ARM devices, or digital annealers.
Related to monitoring, it is also essential to discuss how the computing continuum infrastructure can be adapted if the application's performance is insufficient.Multiple approaches for adaptation, such as the Free Energy Principle and Markov Blanket Approach [20], can be used to adapt the system based on the monitoring data.
In conclusion, scheduling applications on non-von Neumann computers is a challenging task that requires specialized algorithms and monitoring approaches that can handle the unique characteristics of these systems.As these emerging computing paradigms become more prevalent, developing new scheduling techniques to utilise available resources effectively is essential.

Performance predictability
The performance prediction of a system implemented on top of a non-von Neumann architecture requires understanding which phases can be executed in parallel and which can be executed seriallyThis is a necessary step in performance prediction because of the heterogeneous nature of deployments.
Whether a phase that must be executed serially on a given architecture has components that might be parallelizable when deployed on different architectures depends on the nature of the algorithm within the phase.When such possibilities have been identified, we wish to predict the performance of the sequential baseline and the various options for parallelization and deployment.
Consider a computation that can be decomposed into a set of serial phases.On a Gantt chart, each phase corresponds to a horizontal bar whose length represents the anticipated phase duration.If the phases must be executed one after the other, the Gantt chart will look like a staircase descending from left to right with steps of unequal lengths.We can visualize the performance benefits of different non-von Neumann architectures by looking at bars with reduced lengths.The benefits of decomposition into parallel phases are visualized by looking at how overlaps reduce the length of the span time from the leftmost endpoint to the rightmost endpoint.
The execution times could be calculated via mathematical models or obtained by running benchmarks of calculations on platforms with different architectures.This is especially desirable when the characteristics of the program or the platform make it difficult to build an analytic model of how the phase will behave.Predictions might be aided by using a library of benchmark programs compiled differently for different target platforms. 4.6.Sustainability The high power usage in computing caused by the von Neumann bottleneck is a significant sustainability issue, especially when accessing memory through lowbandwidth buses.Therefore, it is crucial to enhance energy efficiency in computing to ensure sustainability.Using specialized memory in non-von Neumann architectures eliminates the need for data transfer between memory and processing units.Their design to perform tasks in parallel reduces the time needed to complete tasks, thereby reducing energy consumption.
However, implementing non-von Neumann architectures in the computing continuum faces challenges, particularly at the edge layer, where devices have restricted compute electrical power and limited battery capacity compared to devices in the cloud.Even with non-von Neumann architectures, there are still technical challenges due to the heterogeneity of nodes in the continuum layers.For example, neuromorphic computers are very power efficient; however, due to their size and production cost, they can not be distributed in large quantities at the network's edge.Therefore, other architectures with lower power requirements could be more suitable as a compromise, such as ARM or TPU.
Despite these challenges, non-von Neumann architectures offer promising solutions for sustainable and energy-efficient nodes, especially at the edge layer.However, to make them feasible for devices in the continuum, there are various aspects to consider, such as integrating and implementing these architectures into complex continuum systems, optimizing their energy consumption, and addressing the technical issues associated with the heterogeneity of nodes.By overcoming these challenges, non-von Neumann architectures could significantly contribute to future computing sustainability.

Conclusion
This article provides a detailed summary of the available and emerging non-von Neumann architectures and their specific characteristics regarding memory and computational management.We formulate the most commonly used application classes, exploring their particular characteristics and current efforts to execute them on non-von Neumann architectures.In addition, we discuss the current barriers to adopting the computing non-von Neumann architectures and explore new research tracks in terms of architecture heterogeneity, communication, interoperability, orchestration, performance prediction, and sustainability.Dragi Kimovski is habilitated researcher on a tenure track at the University of Klagenfurt.
Nishant Saurabh is an assistant professor at the Utrecht University.
Matthijs Jansen is a PhD student at the Vrije Universiteit Amsterdam.
Atakan Aral is an assistant professor at Ume å University and a principal investigator at University of Vienna.
Auday Al-Dulaimy is an assistant professor at M älardalen University and Dalarna University.

2023 11
Andr é B. Bondi is President of Software Performance and Scalability Consulting LLC and an adjunct professor of software engineering at Stevens Institute of Technology.Antonino Galletta is an assistant professor at University of Messina.Alessandro V. Papadopoulos is a professor at M älardalen University.Alexandru Iosup is a professor at Vrije Universiteit Amsterdam.Radu Prodan is a professor at University of Klagenfurt.This article has been accepted for publication in IEEE Internet Computing.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/MIC.2023.3301010This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/

Table 1 :
Qualitative differences between classical von Neumann and non-classical non-von Neumann architectures.

Table 2 :
Summary of the identified application classes and their suitability for efficiently exploiting the available classical non-von Neumann architectures in the computing continuum.

Table 3 :
Summary of the identified application classes and their suitability for efficiently exploiting the available non-classical non-von Neumann architectures in the computing continuum.