Feasibility Study and Porting of the Damped Least Square Algorithm on FPGA

Modern embedded computing platforms used within Cyber-Physical Systems (CPS) are nowadays leveraging more and more often on heterogeneous computing substrates, such as newest Field Programmable Gate Array (FPGA) devices. Compared to general purpose platforms, which have a fixed datapath, FPGAs provide designers the possibility of customizing part of the computing infrastructure, to better shape the execution on the application needs/features, and offer high efficiency in terms of timing and power performance, while naturally featuring parallelism. In the context of FPGA-based CPSs, this article has a two fold mission. On the one hand, it presents an analysis of the Damped Least Square (DLS) algorithm for a perspective hardware implementation. On the other hand, it describes the implementation of a robotic arm controller based on the DLS to numerically solve Inverse Kinematics problems over a heterogeneous FPGA. Assessments involve a Trossen Robotics WidowX robotic arm controlled by a Digilent ZedBoard provided with a Xilinx Zynq FPGA that computes the Inverse Kinematic.


I. INTRODUCTION
Cyber-Physical Systems (CPSs) are complex platforms, composed of physical and computing parts deeply inter-wined, characterized by a strong interaction with environment and users [1]. This continuous interaction requires high flexibility and a certain degree of adaptivity, due to the numerous triggers CPSs are subjected to. The cyber part of a CPS often leverages on embedded computing platforms, which nowadays have become heterogeneous, integrating different types of computing units, e.g. different cores and coprocessors. Such kind of platforms is suitable to serve the needs of CPSs, being capable of acquiring large data-streams from a plethora of different sensors, and of processing them, when required, in parallel. Field-Programmable Gate Array (FPGA)-based platforms can play a crucial role in the described context: they have turned into heterogeneous architectures offering, at the same time, the flexibility and efficiency given by the The associate editor coordinating the review of this manuscript and approving it for publication was Vincenzo Conti . integrated core(s), connected to custom hardware accelerators that can be reconfigured at run-time.
In general, the research activity in this context is flourishing [2]- [8]. In the H2020 CERBERO 1 European Project [8] we worked on providing a continuous design environment for CPSs that, at computational level, relies on many tools for FPGA-based CPSs, which have been accessed during the project time-frame on a Planetary Exploration use-case [9]. An adaptive embedded controller for a robotic manipulator has been implemented over an FPGA. Such a controller was supposed to work in stringent survival conditions (radiation and harsh environment), being able to meet the reliability constraints of a robotic exploration mission, as well as functional constraints such as a good accuracy over the requested tasks. Performance and accuracy are key requirements in robotic applications, not only in the space exploration field. For instance, also surgery [10] and industrial [11] application fields require an accurate trajectory control during their 1 http://www.cerbero-h2020.eu/ VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ tasks execution, even in the proximity of singularity points. At the same time, if online trajectory calculation is enabled, the time requested to evaluate it may be critical in achieving responsiveness.
Most of the commonly used methods for solving the faced problem, namely Forward Kinematic (FK) or Inverse Kinematic (IK), split the trajectory in smaller paths, each calculated singularly, and the higher is the number of smaller paths used to split the original trajectory, the more accurate will be the overall robotic arm movement. As a drawback, if the number of smaller paths grows the amount of calculations increases in turn. So that, to meet such colliding requirements different solutions are adopted, ranging from given algorithm modifications to the usage of different algorithms for different places within the workspace. Besides algorithm-oriented approaches, the explosion of computational complexity could be addressed by relying on modern computing platforms, which are suitable to support parallel computation and offer heterogeneity. Parallelization gives the possibility of executing different operations of the algorithms concurrently, while heterogeneity provides differentiation in terms of computing cores, shaping them according to the functionality they are dedicated to. Nevertheless, parallel and heterogeneous platforms have a cost in terms of design effort, since they are usually hard to program and control in an efficient way. Moreover, most of robotics algorithms are not easy to be parallelized, e.g. due to the dependencies between successive smaller paths calculations, and exploiting heterogeneity by shaping computing cores or by optimizing the execution on the different available ones, according to the involved operations, is not trivial.
This article focuses on the analysis of the Damped Least Squares (DLS) algorithm for IK solution, looking to a prospective hardware implementation on FPGA of a DLS-based hardware controller for a robotic arm. More in details, the contributions of this article are: • the open-source sequential and parallel MATLAB implementations of the DLS [12]. These implementations have been released as open data of the CERBERO project together with a trajectory generator [13] for the workspace described in Section II.
• the analysis of the DLS algorithm, with the specific intent of understanding which parameters could influence a perspective hardware implementation. This analysis is reported in Section V where two different implementations, a traditional sequential one and a parallel one based on segmentation, are discussed and analyzed. To the best of our knowledge no attempt to parallelize the DLS has been done prior to this work.
• a preliminary FPGA-based implementation of the robotic arm controller capable of executing different DLS profiles (baseline and high-performance). As discussed in Section VI, an automated High-Level Synthesis (HLS)-based strategy has been used to reduce the designers effort needed to master the target heterogeneous platform. To the best of our knowledge, despite our implementation is just preliminary and the assessment is covering just the reconfigurable part of the FPGA-based arm controller, the present work is the first one porting DLS for IK problems solving on such heterogeneous substrate. This article is organized as follows. In Section II, the context of the target robotic manipulator is presented. State-ofthe-art IK solutions with a focus on the DLS algorithm are discussed in Section III. In Section IV, state-of-the-art works on hardware implementations of the algorithms solving the IK problem are proposed. In Section V, a detailed analysis of pros and cons in the usage of the DLS algorithm in the context of the CERBERO project is presented. Section VI is dedicated to the design flow and assessment of the proposed hardware implementation of the DLS algorithm. Section VII concludes the paper.

II. REFERENCE PROBLEM
This section provides an overview of the physical context of the proposed work. The target manipulator, the workspace it operates into, and the tasks considered for the execution within such workspace are described in the subsections below.

A. TARGET MANIPULATOR
Any robotic manipulator is composed of different parts, namely: 1) a base; 2) rigid links; 3) joints (each of them connecting two adjacent links); and 4) an end-effector. Figure 1 shows as an angle, θ i , can be associated to each rotational joint. Describing the movement of the manipulator in the space implies to define the trajectory of the end-effector from its initial given spatial coordinates, or from the origin A(x 0 , y 0 , z 0 ), to the final ones, or to the destination Z (x 1 , y 1 , z 1 ). It is possible to describe such a trajectory by: • computing the spatial coordinates of the end-effector from all the joint angles, resolving a FK problem.
• computing the joint angles from a desired end-effector spatial coordinate, resolving an IK problem.
To implement an arm controller, the second type of problem has to be solved to derive the angles of the joints starting from a given desired position (or set of positions). The characteristics/features of the manipulator are generally vendor specific. The target we have considered is a WidowX Robotic Arm by Trossen Robotics [14], characterized by four main Degrees of Freedom (DoFs) for wrist and clamp movements. This manipulator is equipped with six Dynamixel actuators, controllable in terms of angular position, speed and acceleration, and each of them provides a full control of the relative joint via a digital communication using a Universal Asynchronous Receiver-Transmitter (UART) protocol. The manipulator comes with an integrated Arduino-based ArbotiX-M controller, which can be used to control it directly, even though one of the goals of this article is to prove that it is possible to substitute the built-in controller with an FPGA based one.

B. WORKSPACE, TASKS AND REQUIREMENTS
The reachable coordinates depend on the chosen robotic arm and on its fabrication parameters that define the operating workspace, which has to include the set of points reachable by the end-effector, meant as both the initial and final spatial coordinates, as well as the positions along the trajectory. The workspace for a six-axis robotic arm is a six-dimensional space that consists of all possible combinations of values for the six degrees of freedom of the arm, e.g. the complete set of positions and orientations of the robot.
In this workspace, there might be also singularity points that are positions where the IK solver fails to find a convergent solution. Main ones are wrist, elbow and shoulder singularities. Passing close to a singularity results in high joint velocities, which are normally to be avoided since they might damage the arm itself. A properly built workspace could limit, or even prevent, the risks related to singularities.
In this work, we have adopted a singularity-free workspace defined by the Thales Alenia Space company for the chosen type of manipulator and made available as an open-data [15] of the CERBERO project [8]. The workspace is limited to a safe area of operation specified in cylindrical coordinates (see Equations 1,2 and 3). In this region, given the equations and assumptions specified in [15], the end-effector can describe any type of trajectory without encountering singularities. Moreover, moving from A(x 0 , y 0 , z 0 ) to Z (x 1 , y 1 , z 1 ) it is needed to guarantee that sgn(x 0 ) = sgn(x 1 ) or sgn(y 0 ) = sgn(y 1 ), so that not direct displacement is produced from one quadrant of the XY plane to the opposite one.
According to the described workspace, we have derived, and made available as an open-data [13] of the CERBERO project, a trajectory generator. The trajectory generator is a MATLAB script that generates 100 possible trajectories for the WidowX Robotic Arm. For each of them the initial and final position of the end-effector are defined as spatial coordinates within the workspace. By default, the initial point is fixed to A(x 0 , y 0 , z 0 ) = {35, 0, 15} that is the idle resting position of the arm. Along with the script itself, a trajectories.csv file is also provided. It contains the set of trajectories that are used for the algorithm explorations reported in this article in Section V. They can be classified in three groups, short, medium, and long, on the basis of their length. Please note that, given the reference workspace, the minimum length of a trajectory is around 2 cm and the maximum is around 65 cm. The identified groups are composed as follow: The definition of the workspace has to be accompanied by other requirements to specify the reference problem. As it will become evident in Section V, the tasks that the robotic arm has to execute may indeed affect programmers' decisions when choosing/tuning the IK solver. In this article, we consider that the chosen algorithm should be capable of executing two different tasks, presenting different requirements and characteristics: • Task 1: Retrieval of rock samples from planetary surface of Mars -Approximate distance of trajectories: 50 cm -Required positioning accuracy: 10 mm • Task 2: Manipulation of biocontainment -Approximate distance of trajectories: 10 cm -Required positioning accuracy: 5 mm These tasks are the same that have been used to demonstrate CERBERO technologies within the Planetary Exploration use case and have been provided by Thales Alenia Space [15].

III. SOLUTIONS TO THE INVERSE KINEMATIC PROBLEM
Solving an IK problem means computing the joint angles of a body starting from a desired end-effector spatial coordinate. This is a non-linear problem that requires to solve complex trigonometric functions. Indeed, an IK problem can have multiple solutions, only one or even none at all. In particular, the number of solutions depends on the desired end-effector coordinates and the number of DoFs. For instance, a point could result to be non-reachable and thus the problem would be not solvable. On the contrary, two different sets of angles may represent a solution when considering the elbow-up or elbow-down poses of an arm, but possibly only one of them could be physically possible for a certain considered arm. Thus, before solving an IK problem, it is necessary to take in consideration any joint angles limitation and singularities-related problem, and which are the out-of-reach end-effector spatial coordinates. In this regard, it is crucial to properly define the workspace, identifying the out-ofreach trajectories to be avoided by constructions. Similarly, VOLUME 8, 2020 singulatity matters could be avoided by excluding specific points from the workspace, as done by Thales Alenia Space in [15].
The IK problem has been deeply addressed in literature, mainly in the robotic and in the computer graphic fields. The computer graphic field is the one that offers more flexibility in studying the solvers for IK problems, and is the one considered by Aristidou et al. [16] in their classification of the possible solutions for IK. In particular, they consider four main categories of solvers, that present different behaviors: 1) Analytic solutions: considering the length of the arm, the initial position and the physical constraints, the analytic methods are used for finding all the possible solutions to the problem. They guarantee fast computation and the best solution. They generally do not suffer from the singularity problem and are highly reliable. However, they are suitable only for mechanisms with low DoFs. The problem is not easily scalable, the bigger is the chain of links and the more are the DoFs, the more difficult is the solution of the IK problem adopting an analytic solver. 2) Numerical solutions: based on a cost function to be minimized, this family includes the methods that require various iterations to converge over a solution. In general, solutions belonging to this family are better capable of mastering more DoFs and multiple end-effectors (e.g. fingers of a hand or arms of a body). Many algorithms, based on numerical solutions, exist: Heuristic methods and Cyclic Coordinate Descendant ones [17], Newton-based methods exploiting the second-order Taylor expansion, and Jacobianbased ones adopting inversed, pseudo-inversed and transposed Jacobian matrices [18]. 3) Data-Driven methods: learning-based techniques are the foundation for these methods that require having available large datasets and pre-learned postures to match the position of the end-effector. These characteristics make them suitable for complex problems such as human-like structures. 4) Hybrid methods: they are based on the simplification of the IK problem, by decomposing it in a combination of analytical and numerical components and exploiting the related algorithms for the solution. As data-driven methods, the hybrid methods are suitable for complex problems, such human-like structures or multiple end-effectors ones. Due to the peculiarities of the numerical solutions, that are highly suitable for the problem we address in this article, we focus our attention on this family. Indeed, given the chosen target (see Section II-A), where multiple end-effectors or body-like mechanisms are not there, complex data-driven or hybrid methods can be excluded. Having available six DoFs (four for the hand effector positioning, one for wrist and one for clamp), numerical methods seems to be the most straightforward choice. In our studies, we opted for the DLS IK solver, which falls into the Jacobian-based family, due to computational complexity considerations as explained below in Sections III-A and III-B.

A. JACOBIAN-BASED METHODS
Jacobian methods are based on the Jacobian matrix [19], defined as: Such a matrix represents a transformation between two time derivative-related spaces, the Cartesian space and the velocity space. The latter can be related to the velocity joints space.
Solving an IK problem requires finding the joint angles of the body from the desired spatial coordinates. Therefore we need to compute the joint space vector, − → , as: where − → e is the error or displacement vector and J † is computed from the Jacobian matrix, J , accordingly with the chosen Jacobian-based solver. The − → and − → e vectors are expressed as: and − → e = x y z δx δy δz T (6) where [ x y z] T defines a linear displacement, while [δx δy δz] T a rotational one. Figure 2 illustrates the general flow of Jacobian-based IK algorithms, no matter of the chosen specific solver. The trajectory itself is specified by initial and final end-effector coordinates. The former are computed solving a FK problem, which requires to know the angles assumed by the manipulator in the initial position; while the latter correspond to the desired position to be reached. The way of computing J † depends on the chosen Jacobian-based IK solver. In any case, an iterative process to compute the new joint variables is required.
Prior to execute the algorithm, the displacement vector must be defined according to Equation 6. The following simplification hypothesis [δx δy δz] T = [0 0 0] T has been taken, meaning that we will consider linear displacements only. Therefore, trajectories are computed just on the basis of the values assigned to [ x y z] T , and the displacement vector is expressed as: where x init , y init , and z init represent the initial point coordinates, x fin , y fin , and z fin the coordinates of the final position that the end-effector must reach, and i is the iteration parameter.
In our setup, the joint space is composed only by angles, since we are using rotational-only joints. The process to obtain integral angles from the joint space, is iterative: The number of repeated steps can be set by an iteration parameter of the algorithm. At each iteration, the last calculated − → new is used as − → old to compute the next − → new . This procedure is equivalent to the calculation of the defined integral of thetas along the trajectory.
The first mathematical solution for Equation 4 is computing J † as J −1 [18]. However, J might be neither square nor invertible. Therefore, several alternative solutions have been studied to solve this problem. The Jacobian transpose [18] computes J † using the transpose of J as J † = αJ T , while the Pseudo-inverse methods [18] leverage on J † = J T (JJ T ) −1 . The computation of J † , given by the transpose-based solutions, presents low complexity, but transpose-based solutions fail in handling singularities [20], and in many cases null space methods have been adopted to avoid singular configurations [21], [22]. The Single Value Decomposition (SVD) method [23] has been proposed to deal with the singularities. It generalizes the eigen-decomposition of a square normal matrix, decomposing the matrix J as J = UDV T , where U is an m × m unitary orthogonal matrix, D is an m × n rectangular diagonal matrix with non-negative real numbers on the diagonal, and V T is an n × n unitary orthogonal matrix. This solutions is based on the calculation of the eigenvalues and eigenvectors, at high computational costs. The Damped Least Square (DLS) algorithm [24], computes This algorithm handles the singularity problem by using a damping factor λ, and does not require the computation of eigenvalues and eigenvectors. Therefore, in our studies we opted for this latter to have the possibility of porting the designed hardware controller in different workspaces, not necessarily singularity-free ones, and to maintain the computational load of the robotic arm controller to be implemented as low as possible.

B. THE DAMPED LEAST SQUARE ALGORITHM: MATHEMATICAL FORMULATION
The DLS algorithm is also known as Levenberg-Marquardt algorithm [24], and computes J † as: so that, using Equation (8), the Equation (4) can be expressed as: The damping factor, λ, is a non-zero constant that needs to be large enough to make the solution behave well near singularities, but not too large to make the convergence rate excessively slow. This factor can be set as a static parameter, which has to be defined a-priori before starting the computation, or as a dynamic one to be calculated at run-time, between two subsequent iterations. In our hardware implementation, as discussed in Section V-E, a static λ is used.
To obtain the Jacobian J matrix, prior to the application of the DLS, the mathematical model of the chosen manipulator has to be derived. Commonly, the Denavit-Hartenberg (DH) parameters [25] are used. These are a set of four parameters composed of two angles (α and θ) and two displacements (a and d), defined for each link-joint group and obtained from the manipulator mechanical dimensions and joint orientations, as shown in Figure 3. Values related to the chosen set-up/manipulator are reported in Table 1.  For each set of parameters a transformation matrix is defined as: where t 11 = cos θ n t 12 = − sin θ n cos α n t 13 = sin θ n sin α n t 14 = a n cos θ n t 21 = sin θ n t 22 = cos θ n cos α n t 23 = − cos θ n sin α n t 24 = a n sin θ n t 32 = sin α n t 33 = cos α n t 34 This transformation matrix represents the position and the orientation of the n th -joint with respect to the previous one. For an N -joints manipulator there will be N transformation matrices. The general transformation matrix is defined as the chain product of each transformation matrix: This matrix represents the end-effector spatial coordinates and orientation with respect to the manipulator base coordinates. The derived DH parameters, used in Equation (10), allow to determine the general transformation matrix orientation coefficients (n x , n y , n z , o x , o y , o z , a x , a y , and a z ) and the position coefficients (p x , p y , and p z ).
p x , p y and p z are the end-effector spatial coordinates with respect to joint angles, defined throughout the manipulator FK over the final desired position.
10 cos(θ 2 + arctan 3) Once n x , n y , n z , o x , o y , o z , a x , a y , and a z , and p x , p y , and p z are known, according to the theory in [19], Equation (11) can be used to derive J as follow: where 10 sin(θ 2 + arctan 3)

THE DAMPED LEAST SQUARE ALGORITHM: SUMMARY OF THE FEATURES AND CHARACTERISTICS
As already said, the DLS presents a singularities tolerance, which is a crucial desirable characteristic when the workspace is not singularity-free. The DLS is capable of passing through workspace singularities: the algorithm allows tuning the solution around singularities by means of the damping factor, resulting in a smoothed movement. Moreover, it does not require to compute eigenvalues or eigenvectors, leading to a reduced computational complexity more suitable for hardware implementation. Furthermore, the DLS algorithm can achieve a good trade-off between computation time and accuracy even if complex application scenarios, where full body reconstruction are considered [26]. Generally speaking, computation time and accuracy strongly depend on two main parameters of the DLS: the number of iterations, which are the points along the trajectory from source to destination, and the damping factor λ, which define the behavior near the singularities. In particular, fixing the number of iterations per trajectory it is possible to predict computation time and accuracy of the DLS execution. This time predictability is a distinctive feature of the DLS that can drive the choice towards this IK solver under certain constrained application scenarios, such as in critical applications where the algorithm convergence has to be guaranteed within a hard deadline in time.

IV. RELATED WORKS
The literature available on the IK solvers is generally mainly focused on algorithms optimization when calculated offline. Not many state-of-the-art works address the implementations of IK solvers leveraging on hardware acceleration platforms.
With the advent of CPSs and Internet of Things (IoT), devices are every day more power limited, while being requested to be highly computational efficient. Dedicated hardware allows for the efficient execution of complex algorithms even in constrained contexts, where finding trade-offs among system aspects of interest, e.g. power consumption versus quality of service, may be fundamental [27]. Profiling the algorithms to find the sub-parts that can be accelerated is surely necessary, as well as the identification of parallelism, since parallel off-the-shelf or dedicated devices could improve both timing and power performance. In particular, while off-the-shelf devices force designers to shape the parallel execution over the available concurrent resources, dedicated platforms can be customized to support the native parallelism of the algorithm. Dedicated hardware also enables pipelining of resources, which can be seen as a further hardware-level degree of parallelism, additional to the algorithm-level one.
The two subsections below provide, first of all, an overview of the state-of-the-art regarding the available hardware implementations of IK solvers and, then, a closer look into what has been done specifically on the DLS is provided.

A. HARDWARE IMPLEMENTATIONS OF INVERSE KINEMATIC SOLVERS
The hardware acceleration of IK solvers has been only partially explored by the scientific community. Moreover, while pipelining is often applied, algorithm-level parallelization is not. Analytical IK solvers have been accelerated with FPGAs, mainly addressing industrial robotics applications [2], [28]. Other works proposed the adoption of algorithms not commonly used for solving IK problems, but intrinsically parallel. A behavior-based solver has been adopted by Köpper and Berns [6]. Such solver has an intrinsically parallel nature; therefore, it turns out to be suitable for an FPGA implementation: calculation of the different DoFs can be parallelized, resulting in a substantial speed-up with respect to general purpose systems, even if the accelerator has not been optimized yet. Hildebrand et al. [29] instead adopted a conformal geometrical algebra approach to solve the IK problem addressing computer graphics application with FPGA support. Both pipelining and algorithm-level parallelization methods are exploited on the reconfigurable logic of the FPGA, resulting in extremely shorter execution time with respect to a full software processing. Other studies adopted common IK solvers, where algorithm level parallelism is more difficult to be extracted. Yu et al. [30] considered the generalized inverse Jacobian method adopted for FPGA acceleration by means the parallel calculation of matrix multiplications. Suriano et al. [31] instead adopted the iterative Nelder-Mead optimization method to solve the IK problem, investing effort in parallelizing the application in different ways (intra-and inter-iteration). The resulting parallel solutions are then ported on an FPGA fabric tightly coupled with a multi-core system: the application is split between general purpose and dedicated computing units, which are provided in several instances enabling multi-parallelism on the hardware acceleration side. Table 2 summarizes the described IK solvers hardware implementations together with the approach proposed in this work. The analyzed state-of-the-art works adopt algorithms that are not suitable for the manipulator that is used in this work [2], [28], or algorithms that are meant for application fields far from the one considered in this article [29], or even algorithms that might not converge [31]. As already said, we adopted the DLS that has a predictable convergence time for a given trajectory length, and it is capable of tolerating singularities in the workspace. As will be detailed in Section VI, the proposed robotic arm controller will be implemented in hardware on a Xilinx Zynq FPGA platform. In this study we propose also a possible parallelization of the chosen IK solver, following a segmentation-based approach.

B. DAMPED LEAST SQUARE FOR INVERSE KINEMATICS
To the best of our knowledge, considering IK problems there are neither works in literature that address the hardware implementation of the DLS, nor works that analyze the DLS characteristics for a prospective hardware implementation. Works available in literature rarely use a pure DLS approach, meaning a purely numerical solution according to Aristodou et al. [16] classification. They are mainly preliminary studies combining also other behaviors, such as a system for robotic thumbs where both kinematics (DLS for movements) and dynamics (force model for grasping) [32], or full body reconstruction [26].
Other works focus on DLS drawbacks: being an iterative method, where each step (iteration) depends on the previous  one, the error is cumulative and it may result in a considerable distance between the actual and the desired end-effector position, especially when several DoFs are involved. To reduce the error, the number of iterations (see Section V-B) can be increased at the expenses of the computation time, which become longer. Another option to reduce the error is to couple DLS with other methods, such as the Newton-Raphson one that limits the cumulative error of DLS when it becomes too large, improving accuracy without increasing the DLS iterations [33].
Most of DLS implementations, in combination with other methods, play on the algorithm knobs: number of iterations and damping factor. Such a combination mainly speeds-up computation and achieves accuracy improvements. Wang et al. [34], [35] focused on the number of iterations, adjusting it by means of supervised learning techniques, considering different DoFs and different target platforms for implementation, among which an embedded multi-core system.
Other works focused more on the damping factor, which also impacts on accuracy and computation time of the DLS algorithm. In particular, addressing snake-like robots employed in micro-surgery applications, where hard accuracy and time requirements are present, Omisore et al. [10] adopted a deep-learning approach to optimally adjust the damping factor for every target point of the DLS, obtaining speed-up and accuracy improvements. Damping factor optimization is also desired for enhancing behavior of the DLS algorithm when passing close to singularities. Some works also aimed at such goal, by optimizing it by means of genetic algorithms [36] or closed-loops implementations, where the singularity problem and the Jacobian matrix are taken into consideration for damping factor adjustment [37]. Table 3 summarizes the works adopting DLS for IK solving and their main features and characteristics. Most of these works focus on the algorithm optimization under certain metrics, as computing time and accuracy, by coupling DLS with other algorithms or by playing with its parameters. None of those works focus on implementation details of the DLS, such as parallelization or optimization for execution on specific devices. Considering the addressed application fields, implementation aspects should be considered in future due to the stringent constraints and autonomous behavior possibilities, which are already envisioned for some of them, like for space and micro-surgery purposes. Based on this analysis, to the best of our knowledge, the proposed work is the first attempt to study parallelization and hardware acceleration of DLS algorithm for IK solving.

V. DAMPED LEAST SQUARES ALGORITHM: ANALYSIS
The goal of this analysis is to verify features, pros and cons of the DLS algorithm, when parameters vary. In particular we analyze two different MATLAB implementations that we made available as an open-data [12] of the CERBERO project. We will refer to these implementation as: • classic -intended as fully sequential, where prior to start a new iteration of the algorithm the current computation of Equation 7 should be completed. In this case, programmers can play with two parameters: the damping factor and the number of iterations.
• parallel -This implementation is composed of two steps.
In the first one the DLS is used to obtain coarse points over the whole trajectory; the outcome is a segmentation of the whole trajectory. In the second step, the DLS is run again over each segment, using as starting and final points the sequence of adjacent points calculated in the first step. In this case, programmers can play with three parameters: the damping factor, the number of iterations in the first DLS step and the number of iterations in the second DLS step. Figure 4 illustrates an example of the operation of the two implementations. The parallel implementation has been considered to understand the potential benefits of implementing the DLS over a parallel computing infrastructure, as the one used in [31]. Clearly, due to physical hardware limitations, we cannot assume an infinite parallel factor. As an example, dividing a trajectory into 50 different segments would require the availability of as many parallel computing units, which is highly unlikely in practice. For this reason, in the experiments below, we have considered up to 4 parallel computing units (that could be prospectively different cores or dedicated co-processors) to estimate the execution time.
Our explorations focus mainly on the error reaching the desired position and the elapsed execution time needed to compute the trajectories in [13]. All the raw data are available in [12]. Time measurements are obtained by using the ''tic/toc'' MATLAB functions, thus they are machinedependent. 2 In the classic case the provided estimation time is straightforward, while in the parallel one the estimation is given by summing up the time needed to compute step 1 (t s1 ), plus the time needed to compute step 2 (t s2 ) weighted with respect to the available parallel computing units. Equation 13 provides an example of the formula used to estimate the elapsed execution time needed to compute N different parallel segments over k parallel computing units.
Regarding the error, considering Z (x 1 , y 1 , z 1 ) as the final desired position of the end-effector, it is possible to define the error among the desired target and the actual reached position as computed by the chosen implementation. Equation 14 is representative of the error with respect to Z when using the classic implementation of the DLS, while Equation 15 when using the parallel implementation of the DLS. err cl = (x 1cl −x 1 ) 2 +(y 1cl −y 1 ) 2 +(z 1cl −z 1 ) 2 (14) 2 In all the experiments reported below the following machine has been adopted: PC Intel Core i7 8550U / 1.8GHz / 8 MB Cache / 16 GB DDR4 memory err par = (x 1par −x 1 ) 2 +(y 1par −y 1 ) 2 +(z 1par −z 1 ) 2 (15) In the subsections below the 15 different explorations listed in Table 4 are discussed in details.

A. SCENARIO 1: FIXED ITERATIONS AND FIXED DAMPING FACTOR
This analysis refers to Scenario 1 in Table 4. All the results are obtained with a fixed damping factor λ = 0.5. For the parallel implementation, we have chosen 100 iterations in the first step of the algorithm, which leads to have 100 segments. Then, for each segment, 10 iterations of the DLS are executed. Therefore, the DLS is executed 100 + 100 * 10 = 1100 times. To allow a fair computational comparison, classic is run with 1100 iterations too. Figure 5 reports on the error among the desired target and the actual reached position for each trajectory in [13]. This figure includes two curves: Classic is representative of Equation 14 and Parallel is representative of Equation 15. The trend of the curves is the same and the error increases with the trajectory ID that, as already stated in Section II-B, are ordered by length. From Figure 6 to Figure 8 the analysis of the different groups of trajectories in [13] is shown. Trajectories are grouped according to their length as described in Section II. In case of small trajectories both DLS implementations perform well. It is never the case that the error is above the worst given tolerance limit, which is set to 5 mm according to the requirements presented in    Section II-B. There is a good margin to even reduce the number of iterations in both the classic and the parallel case. For medium trajectories the situation changes. The classic implementation still guarantee that the requirement is met, the error is always below the given tolerance limit and there is still a good margin to reduce the number of iterations, maximum error is 1.65 mm. In the parallel case, instead, those trajectories with a length longer than 28 cm start to present an error with respect to the target that is larger than the required accuracy. For long the situation is the same: the classic implementation allows meeting the requirement with a maximum error of 3.03 mm, while this is never the case when the parallel one is adopted.
Both classic and parallel implementations would be fine for Task 2 (see Section II), which is characterized by trajectories having an approximate length of 10 cm belonging to the short group identified above. Therefore, in both cases the accuracy stay always below the 5 mm required by Task 2. If we consider the computation time, we can appreciate the advantage of the parallel solution with respect to the classic one. Figure 9 presents three curves reporting the elapsed time to compute short trajectories, two of them refer to the parallel implementation of the DLS. The Ideal Parallel considers having available an infinite number of parallel computing units; therefore, in Equation 13, N k = 1 and time par = t s1 + t s2 . The Parallel Over 4 Slots sets k = 4; therefore, in Equation 13, N k = 25 and time par = t s1 +25 * t s2 . In general, the advantage of segmentation is clear. The average processing time for the classic implementation is 11.23 ms with a standard deviation of 2 ms, while for the realistic parallel execution over 4 slots the average processing time is 4.11 ms with a standard deviation of 1.61.

SHORT TRAJECTORIES
Parallelization via segmentation has positive effect on short trajectories, the error is kept within the tolerance. Timing advantage is there, being on average nearly 1/3 of the classic one. In practice, this advantage would strongly depend on the communication overhead, and target dependent results are expected.
In general, there are margins to reduce the number of iterations, since the difference between the computed errors and the expected accuracy is large. This is particularly true for the classic DLS implementation, which is discussed more in details in Section V-D. Nevertheless, reducing the number of iterations may impact on the smoothness of the movement itself. Therefore, the identification of the minimum number of iterations to be considered has to be defined also according to the physical impact on the movement.
The other two groups, medium and long, are relevant for Task 1, which contains trajectories that can be up to 50 cm long. In this case the accuracy is set to 10 mm. This means that the parallel implementation would fit when executing Task 1 over medium trajectories, being the actual error in this group always less than 7 mm. While for the long ones this is not the case: trajectories around 50 cm present an error around 10 mm (i.e Trajectory 87 has a length of 50.63 cm and an error of 8.61 mm, and Trajectory 88 has a length of 50.70 cm and an error of 10.4 mm). Trajectories from 88 to 100 have even larger error values, and are theoretically reachable by the arm, but not relevant for Task 2. In terms of elapsed execution  time, the trend of the curves presented in Figure 10 is the same of Figure 9 and the theoretical saving is still approximately 1/3 in favor of the parallel implementation of the DLS with respect to the classic one.

MEDIUM TRAJECTORIES
Parallelization via segmentation has positive effects on medium trajectories too: the error is kept always within the tolerance and execution time is lower. Therefore, the same considerations made for the short case stands here too.
LONG TRAJECTORIES Parallelization via segmentation does not have a generally positive effect on long trajectories. Tolerance boundaries are, in some cases, either exceeded or no margins are there. Viable solutions to overcome this issue may be the following: • experiment different values of the damping factor, for which you should refer to Section V-B.
• change the number of segments, for which you should refer to Section V-C.

B. SCENARIO 2 -FOCUS ON PARALLEL: FIXED ITERATIONS AND VARIABLE DAMPING FACTOR
This analysis refers to Scenario 2 in Table 4. It focuses on the parallel implementation of the DLS and all the results are obtained for a fixed number of iterations (trajectories of 1100 points), while varying the damping factor λ that can assume the values 0.1, 0.2, 0.5 and 0.8. The damping factor λ influences not only the behavior around singularities, but also accuracy [38]: smaller λ should provide, in general, more accurate solutions. Therefore, choosing a small λ could solve the long trajectories issues discussed in the previous section. Figure 11 reports on this analysis for the long trajectories of the given workspace only, since they are those affected by accuracy issues when a parallel implementation of the DLS is chosen. The trend of the error for all the chosen damping factor value is the same and, as clearly demonstrated also in Table 5, the variation of λ, when we are so close to the workspace boundaries, is not sufficient to guarantee that the requirement on the accuracy is safely met.

LONG TRAJECTORIES
Parallelization via segmentation, even considering a variable λ, does not have a generally positive effect on long trajectories. Tolerance boundaries are still, in some cases, either exceeded or no margins are there.  The third considered scenario is meant to assess what would happen by changing the number of iterations in the first step of the parallel DLS implementation. Two different runs are compared, where the number of points is 100 (as in SC #1), and 150. The number of iterations per segment in the second step is kept equal to 10 and the damping factor is 0.5. The focus, as shown in Figure 12, is kept on the long trajectories of the given workspace only, since the issue of their accuracy when adopting a parallel DLS implementation is still there. The trend of the error curves is the same in both 100 (10) and 150 (10) cases, but increasing the overall number of segments has a positive effect: a reduced number of trajectories (98 to 100) is at the limit or exceeds the given accuracy requirement for Task 1. Nevertheless, having a closer look to them, as reported in Table 6, they all have a length above 60 cm, while Task 1 presents lengths around 50 cm. Therefore, despite being included in the workspace, they are not relevant for the considered task. The boundaries of Task 1 are provided by trajectories 85 to 88, which error is well below (maximum error is 7.34 mm, when up to 10 mm are admitted) the required accuracy when 150 iterations in step 1 are chosen. Figure 13 reports on the elapsed time,   (10), while Classic 1650 has as many iterations as Parallel 150 (10). Still parallel implementations are preferable, in terms of elapsed time, than their corresponding classic version. Moreover, even considering 150 iterations in the first step, the elapsed time is smaller than in the classic implementation with fewer number of points.
LONG TRAJECTORIES Parallelization via segmentation is positively affected by an increased N, number of points in the fist iteration of the DLS. Accuracy is met with a good margin. In the reference problem considered in this article, there are no limitations on timing or on the number of parallel computing units that can be used, so that we can conclude that playing with N leads to keep the accuracy under control also on Task 1.

D. SCENARIO 4 -FOCUS ON CLASSIC ONLY: VARIABLE ITERATIONS AND FIXED DAMPING FACTOR
Having said that the number of iterations matters, this scenario is focused on exploring the relationship between the number of iterations and the accuracy when adopting the classic DLS implementation to solve IK problems. By looking at Figures 14 to 16 one may notice that, no matter  on the chosen number of iterations, the error trend is growing with the trajectory ID, that is with the trajectory length. Only for N = 500 and N = 1000 all the trajectories meet the worst case accuracy requirement (Task 2, 5 mm accuracy). By looking at Table 7, we can draw more considerations related to what happens at the boundaries of Task 1 and Task 2 in the considered workspace. Regarding Task 2, all the chosen N are fine and there is still a good margin to further reduce N since in the worst case an error of 2.64 mm is there, while the tolerance is set to 5 mm, so more or less 50%. This is not the case for Task 1. Given the chosen N values, we need at least N = 250 to guarantee that the tolerance of 10 mm is met: in the worst case an error of 4.78 mm is there, while the tolerance is set to 10 mm, so again more or less 50%. In both cases, you can assume a satisfactory margin when the density of points per trajectory (columns D in Table 7), defined as the length divided by the number of iterations, is around 0.2 cm.

ALL TRAJECTORIES
If we consider no constraint in terms of timing, a classic fully sequential implementation of the DLS IK algorithm can be chosen. Also in this case, it is important to properly choose the number of iterations to be performed. By looking at our reference problem, it can be empirically derived that, to have a considerable margin (50% as defined above) with respect to the reference requirement, a density of points per trajectory of 0.2 cm should be fine.

E. WRAP UP OF THE DAMPED LEAST SQUARE ANALYSIS
To summarize all the carried out analysis: • a one-fit-to-all solution is not there. The DLS solver has to be characterized according to the executed task, to its requirements (e.g. accuracy) and to its features (e.g. approximate length of trajectories within a task).
• despite it is true that λ is fundamental to address the singularity problem, in a singularity-free workspace this parameter is not fundamental to meet the accuracy requirement, as the number of iterations turned out to be. Given the fact that in our reference problem there are no timing limitations to be considered, which may lead to opt for a parallel solution, we decided to implement in hardware the classic algorithm. Indeed, the classic solver has proved to imply overall less iterations to meet the accuracy requirements with larger margins that, in turn, means less communication overhead. Please note that, a parallel implementation can be easily derived from the classic one, since it has only to be replicated while properly managing data flowing.
In the assessment below, implementations are characterized by: i) a static λ, which is fixed to 0.5, and ii) by a number of iterations per execution that is 1. Nevertheless, in all the proposed hardware implementations, λ can be changed at runtime, making them suitable for other contexts and workspaces which may involve singularities. Moreover, multiple runs of the hardware implementations can be done VOLUME 8, 2020 according, e.g., to the empirical optimal density of points per trajectory derived in Section V-D, which is 0.2 cm. In summary, the proposed implementations can be potentially used with varying λ and number of iterations.

VI. HW IMPLEMENTATION
As already discussed in the introduction of this work, FPGA implementations could be extremely advantageous in the reference context since they are by construction flexible computing substrates, suitable to the reactive and dynamic behaviors characterizing CPSs, while being capable to efficiently support heavy computational workloads.
Therefore, in this section, we explore the possibility of implementing the DLS for the solution of IK problems with dedicated hardware. In Section VI-A we present the considered target architecture and the adopted design flow, in Section VI-B the DLS application is profiled and optimized, and finally in Section VI-C the obtained preliminary experimental results are discussed. Figure 17 illustrates the scenario setup addressed in this work. The Trossen Robotics WidowX Robotic aArm is controlled by an FPGA, in particular the one illustrated in figure is a Xilinx Zedboard, a Zynq FPGA that integrates the XC7Z020CLG484 chip. The reference architecture addressed in this work is a processor-coprocessor configuration where the processor runs most of the control flow tasks, while the coprocessor computes the onerous tasks, which in this case are related to the DLS calculation. The processor is the dual core ARM Cortex-A9 present on the addressed ZedBoard, while the coprocessor is fully custom and is generated with the Multi-Dataflow Composer (MDC) tool 3 [39], [40]. Processor and coprocessor communicate through an AMBA AXI4 system bus: the AXI4-Lite protocol is adopted for coprocessor configuration and control purposes, while the AXI4-full one is used for data transfers. This kind of coprocessing unit, being accessible through a commercial and widely adopted bus system, is suitable to different types of implementation. In the presented case, as discussed in Section V-E, there is no need of parallelization, but potentially the same type of coprocessor could be used in a parallel architecture too [5].

A. ARCHITECTURE AND DESIGN FLOW
To deploy such a system, we can exploit a dataflow-based design flow, in which applications to be accelerated in the coprocessor are specified as dataflow models [41]. A dataflow is basically a functional programming language describing applications through directed graphs whose nodes are the operations of the applications, defined as the actors. Dataflows already revealed to be effective in addressing software and/or hardware design and mapping in heterogeneous systems [42]. Leveraging on dataflow specifications, it was possible to exploit a set of commercial and academic tools for a complete automated design flow. In particular, 3 Available Open-Source at: https://github.com/mdc-suite/mdc the MDC tool supports the coprocessor design and deployment. It provides a dataflow-based automated flow targeting reconfigurable coprocessors, which can implement i) different functionalities with some common operations, or ii) different profiles (e.g. in terms of timing performance, power consumption, etc.) of the same functionality. Actors can be specified using standard imperative languages (e.g. C), which can be used to derive the Hardware Description Language (HDL) specifications, necessary for hardware implementation of the coprocessor, by leveraging on HLS engines [43]. Once the functionalities are modelled through dataflows and the corresponding HDL actors are provided, MDC is capable of automatically deriving the corresponding reconfigurable coprocessor in which common actors are shared by the different functionalities/execution profiles, and reconfiguration is enabled by multiplexing resources in time. Please note that, the adoption of dataflows to model the DLS application and of the MDC tool to generate the system opens to the possibility of exploiting, in future, the advanced features offered by this tool, such as coprocessor topology optimization [44] and power management [45].
In the considered context, we used MDC reconfigurability support to derive a design able to execute different profiles for the classic implementation of the DLS algorithm. In particular, we explored a baseline (BL) profile with slow DLS calculation but limited power consumption, and a high-performance (HP) profile with fast DLS calculation but high power consumption. The capability of switching among different execution profiles is particularly useful in self-powered systems. The HP profile could potentially be enabled only when critical operations, with strict time constraints, have to be performed; while, in default conditions, the BL profile is enabled to preserve the batteries charge.
To port the above-mentioned profiles in hardware over an MDC-compliant coprocessor the following steps have been performed. Starting from the MATLAB classic (see Section V) implementation of the DLS algorithm, the corresponding C code has been derived using the MATLAB Coder feature [46]. Once obtained an imperative C version, the DLS execution has been partitioned into five operations, which constitute the dataflow actors: • J_Matrix calculates the Jacobian matrix J of the arm (see Equation 12); • J2_Matrix calculates J T J + λ 2 I ; • Min performs a first step for the calculation of J T J + λ 2 I −1 ; • J2Cof_Matrix finalizes the calculation of J T J + λ 2 I −1 ; • Theta performs final calculations of the new angles starting from the output of the other actors and according to the target position of the end effector (see Equation 9).
A graphical overview of the resulting dataflow, showing the BL profile, is depicted in Figure 18. Here, it is possible to see the operations execution order and mutual dependencies. Please note that, in the resulting hardware implementation, all the operations will be executed asynchronously since they are performed by dedicated logic and First In First Out (FIFO) data buffers are placed between actors, allowing for a natural implementation of pipelining.
The HDL descriptions of the actors have been automatically derived through Vivado HLS, which is the HLS engine provided by Xilinx [47]. As common HLS engines, Vivado HLS allows the developer to tune the obtained HDL, in particular playing on the trade-off among resources and latency, through pragmas. In this way, it is possible to derive several versions of the actors/operations with different resources versus latency trade-offs, as explained here-under.

B. APPLICATION PROFILING AND OPTIMIZATION
Based on the profiling and performance of the BL profile, the HP profile has been derived, leveraging on the pragmas made available by Vivado HLS. The pragmas have been applied to the most computationally intensive actors within the BL dataflow. Table 8 depicts the latency required by the BL actors for executing one step of the DLS algorithm. The latency has been measured through a post-synthesis simulation of the actors performed with Vivado Simulator. J_Matrix is the most time consuming actor, being responsible of about 45% of the overall latency. Also J2_Matrix and Theta are quite computationally intensive: their execution requires more than 20% of the overall execution time. J2Cof_Matrix and Theta are instead requiring not more than 10% of total latency each.
According to the profiled data, three actors have been identified for latency optimization: J_Matrix, J2_Matrix and Theta. J_Matrix_HP, J2_Matrix_HP and Theta_HP have been derived instrumenting the corresponding C specifications with pragmas, mainly aimed at pipelining and unrolling inner loops of the computation. So that, two different execution profiles of the DLS hardware implementation have been developed: • BL profile, involving non optimized J_Matrix, J2_Matrix and Theta; • HP profile, involving the optimized J_Matrix_HP, J2_Matrix_HP and Theta_HP actors. Please note that BL and HP profiles share the remaining actors, Min and J2Cof_Matrix.

C. PRELIMINARY ASSESSMENT
In this section we assess the reconfigurable DLS implementation (reconf), derived using the design flow described in Section VI-A, and capable of performing both BL and HP profiles. Please note that this assessment focuses on a preliminary study of the reconfigurable DLS hardware datapath (MDC CGR accelerator in Figure 17). Therefore, we mainly focused on the study of the resource occupancy on board, which gives a first hint on the feasibility of the implementation, and on the latency and power data, which give hints on the possibility of achieving effective trade-off switching between the BL and HL configurations. The proposed implementation has been evaluated considering: i) resource occupancy data, coming from logic synthesis of the actors and the reconfigurable DLS HDL specification through the Vivado Synthesis Solution; ii) latency data, measured with post-synthesis simulations of the design running at 100 MHz and performed with the Vivado Simulator; iii) power consumption data, given by the Vivado Power Report considering  the real switching activity of the system gathered during the same post-synthesis simulations running at 100 MHz. Table 9 depicts resources occupancy details of the reconf design and of its standalone actors. The most time consuming actor, that is J_Matrix according to Table 8, is also the most resource hungry one. Such relationship is not always verified since J2Cof_Matrix, which is one of the less time consuming actors according to Table 8, is the second one in terms of resource occupancy. What is sure is that HP actors consume always more resources than the corresponding BL version. Such increase in resources demand is maximum for J_Matrix, where DSP touches +389%, and minimum for BRAM in J2_Matrix and Theta. In terms of reconfigurability, we can notice that the reconf design, which can execute both BL and HP profiles, takes less resources than the sum of single actors (see sum row in Table 9) synthesized stand-alone. This is due to optimizations made by the synthesizer. Indeed, resources partially employed in single actor cases are fully exploited, in shared manner, in the reconf one. Table 10 shows execution latency and power consumption of the reconfigurable DLS implementation while operating under the different execution profiles. The desired trade-off is now clear: providing HP versions of J_Matrix, J2_Matrix and Theta actors made it possible to almost halving the execution latency in the HP profile. In this case a single DLS iteration terminates in a bit more than 34 µs, against the almost 63 µs of the BL one. As a drawback, the increased resources usage of the HP profile is reflected on the consumed power: HP consumes 9% more overall power, which reaches 19% if the only dynamic contribution 4 is considered.

VII. CONCLUSION
In the context of CPSs design, it turned out to be particularly important for the embedded computing platforms to support complex execution workloads without sacrificing flexibility, to be able to dynamically react to system-and environmental-driven triggers. Heterogeneous platforms, such as modern FPGAs, could easily serve the purpose and, lately, have been used in many new application fields, one of them is certainly the robotic one.
In this article we presented the first attempt of implementing the DLS algorithm, for IK problems solution, over an FPGA substrate. The implementation has been realized leveraging on commercial (MATLAB, MATLAB Coder, Vivado HLS, Vivado) and open-source academic (MDC) tools, to mitigate the designer effort and make the implementation more straightforward also for software programmers. The developed implementation is suitable for playing at runtime with the main algorithm parameters, such as number of iterations or damping factor. Preliminary results showed that the proposed implementation is capable of processing one DLS iteration in less than 0.1 ms and that different execution profiles, trading-off execution latency and power consumption, can be enabled.
The other important contribution of this article is the analysis of the DLS for a perspective parallel implementation. From our study on the algorithm, it turned out that segmentation could be used to reduce the execution time, at the price of adopting more iterations of the DLS with respect to a classical sequential solution. Such a price could worth to be paid in strictly timing constrained scenarios, and a parallel hardware implementation like the one presented in [31] for the Nelder Mead IK solver could be adopted. The coprocessing infrastructure presented in Figure 17 is already compliant [5] with the parallel FPGA-based architecture presented in [4] and used in [31] for robotic purposes. Therefore, in future, a parallel hardware implementation of the DLS could be certainly attempted.
CARLO SAU (Member, IEEE) received the degree in electronic engineering and the Ph.D. degree from the University of Cagliari, in 2012 and 2016, respectively. He is currently an Assistant Professor with the University of Cagliari. Since 2012, he has been involved in automated methodologies for dataflow-based reconfigurable platforms generation. His main research interests include reconfigurable system design and the development of code generation tools for advanced reconfigurable hardware architectures.
TIZIANA FANNI received the degree in electronic engineering and the Ph.D. degree from the University of Cagliari, in 2014 and 2019, respectively. From 2014 to September 2019, she stayed with the Department of Electrical and Electronic Engineering, University of Cagliari, where she was working on power saving methodologies for dataflow-based reconfigurable platforms. She is currently a Postdoctoral Researcher with the University of Sassari. Her main research interests include reconfigurable systems design and the development of code generation tools for low-power reconfigurable hardware architectures.
CLAUDIO RUBATTU (Member, IEEE) received the degree in electronic engineering from the University of Cagliari, Italy, in 2015. He is currently pursuing the Ph.D. degree with the INSA Rennes, France. In May 2015, he started a one year and six months research grant related to embedded system design in bio-medical applications at the University of Cagliari. He is also a Research Assistant with the University of Sassari, Italy. In particular, he also focused on the development of predictable models of architecture for multicore systems with hardware acceleration in the context of embedded systems and CPSs. His recent publications are related to dataflow-based design flows aimed at achieving predictability for systems that consider coarse-grained reconfigurable accelerators. His main research interests include reconfigurable systems design and the development of code generation tools for reconfigurable hardware/software architectures.
LUCA FANNI received the degree in electronic engineering from Roma Tre University, in 2018. He worked as a Research Fellow with the University of Sassari, up to January 2020.
LUIGI RAFFO (Member, IEEE) received the Laurea degree in electronic engineering and the Ph.D. degree in electronics and computer science from the University of Genoa, Italy, in 1989 and 1994, respectively. In 1994, he joined the Department of Electrical and Electronic Engineering, University of Cagliari, Italy, as an Assistant Professor, and an Associate Professor, in 1998, where he has also been a Full Professor of Electronics, since 2006. He teaches courses on system/digital and analog electronic design and processor architectures for the courses of studies in electronic and biomedical engineering. He was a Coordinator of the Project EU IST-FET-IST-2001-39266-BEST and the MADNESS EU Project. He was also a Unit Coordinator of the Project EU IST-FET-SHAPES-Scalable Software Hardware Architecture Platform for Embedded Systems. He is also a Local Coordinator of industrial projects in the field (among others: ST-Microelectronics-Extension of ST200 architecture for ARM binary compatibility and ST-Microelectronics-Network on chip). He is also responsible for the cooperation programs in the field of embedded systems with several other European universities. He was a Local Coordinator of the ASAM (ARTEMIS-JU) and ALBA projects (national founded project) and RPCT (regional founded project).
FRANCESCA PALUMBO (Member, IEEE) received the Laurea degree (summa cum laude) in electronic engineering from the University of Cagliari, in 2005, the master's degree in advanced embedded system design from the Advanced Learning and Research Institute, University of Lugano, in 2006, and the Ph.D. degree in electronic and computer engineering from the University of Cagliari. She is currently an Associate Professor with the University of Sassari. Her research interests include reconfigurable systems and to code generation tools and design automation strategies for advanced reconfigurable hardware architectures. She is also a Permanent Steering Committee Member of the ACM Conference on Computing Frontiers. For her studies in the fields of dataflow-based programming and hardware customization, she received two best paper awards at the Conference on Design and Architectures for Signal and Image Processing, in 2011 and 2015. She serves in several different technical committee of international conferences. She was the Scientific Coordinator of the CERBERO H2020 European Project on Smart Cyber Physical System Design and the Scientific Director of the Summer School entitled Designing Cyber-Physical Systems-From concepts to implementation. She is also an Associate Editor of the Journal of Signal Processing Systems (Springer) and the IEEE EMBEDDED SYSTEMS LETTERS.