Concretization of Abstract Traffic Scene Specifications Using Metaheuristic Search

Existing safety assurance approaches for autonomous vehicles (AVs) perform system-level safety evaluation by placing the AV-under-test in challenging traffic scenarios captured by abstract scenario specifications and investigated in realistic traffic simulators. As a first step towards scenario-based testing of AVs, the initial scene of a traffic scenario must be concretized. In this context, the scene concretization challenge takes as input a high-level specification of abstract traffic scenes and aims to map them to concrete scenes where exact numeric initial values are defined for each attribute of a vehicle (e.g. position or velocity). In this paper, we propose a traffic scene concretization approach that places vehicles on realistic road maps such that they satisfy an extensible set of abstract constraints defined by an expressive scene specification language which also supports static detection of inconsistencies. Then, abstract constraints are mapped to corresponding numeric constraints, which are solved by metaheuristic search with customizable objective functions and constraint aggregation strategies. We conduct a series of experiments over three realistic road maps to compare eight configurations of our approach with three variations of the state-of-the-art Scenic tool, and to evaluate its scalability.


INTRODUCTION
The increasing popularity of autonomous vehicles (AVs) has resulted in a rising interest in their safety assurance.Existing safety assurance approaches [1], [2] test AVs by placing them in challenging traffic scenarios to evaluate their systemlevel safety (e.g. if an AV can follow a trajectory without getting into a collision).Graph models (e.g.scene graphs) are frequently used to define such test scenarios along qualitative abstractions (relations) of concrete values and positions of scenario actors.Such a formal representations particularly allows for the analysis of various properties at the level of test scenario suites through high-level metrics such as situation coverage [3], [4].
On the one hand, safety experts and standards typically express scenarios at a high-level of abstraction using abstract relations between various actors to evaluate situation coverage of a test suite.On the other hand, modern traffic simulators (like CARLA [5] or DriveSim [6]) necessitate concrete scenarios with exact numeric values provided for the various actors in order to evaluate the safety compliance of each test scenario.To derive such concrete test scenarios, first, an initial concrete scene needs to be derived from the abstract scenario representation.The initial scene is then augmented with concrete behavior before being run in simulation.As a key challenge, automated concretization of an initial scene takes an abstract scene specification with numerous high-level constraints as input, and automatically derives concrete scenes by providing concrete parameter values for each actor.Since the relevance of certain test scenarios may depend on the physical location (e.g. in case of geofencing for AVs), scene concretization parameterized within a designated geographical location is particularly challenging.
Graph model generation has been used extensively in research to derive models that satisfy high-level (abstract) constraints.Existing approaches may rely on logic solvers [7], metaheuristic search [8] or a dedicated graph solver [9].However, such approaches derive abstract graph models as output without any numeric information, which is insufficient for scene concretization.To derive numeric solutions, modern model generators [2], [10] propose a hybrid search technique that integrates a back-end numeric reasoning tool.However, their use in traffic scene concretization is limited to generating instances of a specific type of traffic scene (selected a priori) over a simple pre-defined map.
Specialized traffic scene concretization approaches such as the state-of-the-art SCENIC tool [11] have been developed with a custom scene specification language to capture arbitrary abstract constraints over a custom road map (both given as input).However, the limited expressiveness of these languages (e.g.related to the allowed constraint structures) prevents adequate measurement of situation coverage as necessitated in safety standards for road vehicles [12], [13].Furthermore, as shown in our paper, the underlying exploration strategies are not scalable enough to provide effective assurance for AVs aligned with these safety standards.
In this paper, we propose a scene concretization approach that automatically derives concrete scenes in accordance with an abstract scene specification as input.The specific contributions of the paper are as follows: expressive (abstract) functional scene specification language with 4-valued partial model semantics that generalizes the SCENIC language [11] and enables static detection of inconsistencies at specification time.‚ (C2) Mapping for abstract constraints: We define an extensible mapping from abstract (relational) constraints to corresponding numeric constraints to derive a numeric scene concretization problem.‚ (C3) Integration of metaheuristic search: We formalize the scene concretization problem as a customizable optimization problem which we solve using metaheuristic search algorithms.‚ (C4) Extensive evaluation: We evaluate eight configurations of our proposed approach over three realistic road maps to assess success rate, runtime and scalability compared to the SCENIC tool.The rest of the paper is structured as follows: Section 2 summarizes core concepts related to traffic scenes and scenarios used in safety assurance of AVs.Section 3 introduces the novel, expressive abstract scene specification language.Section 4 presents the mapping from abstract to numeric constraints.Section 5 details the adaptation of metaheuristic search for traffic scene concretization.Section 6 provides evaluation results of our proposed approach for three case studies.Section 7 overviews related approaches available in the literature.Finally, Section 8 concludes the paper.

Traffic scenes and scenarios
Traffic scene is defined by Ulbrich, et al. [14] as a snapshot of the environment, including the scenery and dynamic elements, as well as the relations between those entities.
‚ The scenery is comprised of the lane network, stationary components such as traffic lights and curbs, vertical elevation of roads and environmental conditions.‚ Dynamic elements (or actors) include the various vehicles and pedestrians involved in a scene such as the ego vehicle.A scene may contain information about the state (e.g.position and speed) and attributes (e.g.vehicle color, whether a car door is open) of actors.‚ Relations are defined between scenery elements and actors.For example, two vehicles may be far from each other, or a vehicle may be placed on a specific lane.See Section 3 for further details.
A sequence of consecutive traffic scenes together with related temporal developments corresponds to a scenario.A scenario is defined by an initial scene, followed by a sequence of actions and events performed by the actors according to individual goals and values.Actions and events may refer to traffic maneuvers (e.g. a lane change maneuver), while goals and values may be transient (e.g.reaching a certain area on a map) or permanent (e.g.driving in a safe manner).
Existing safety standards (e.g.ISO 26262-1 [12] and SOTIF [13]) place system-level safety requirements and restrictions on autonomous vehicles (AVs) under test.Such requirements are often formalized as high-level constraints between actors.Adherence to such safety requirements is often evaluated by using sophisticated traffic simulators like CARLA [5] or DRIVE Sim [6] which can only handle a lower-level representation of the investigated scenarios.

Levels of abstraction in traffic scenarios
Menzel, et al. [15] define three abstraction levels to adequately describe traffic scenarios for simulating AVs [16].[1] Functional Scenarios include abstract (qualitative) constraints pertaining to traffic concepts.For example, such abstract constraints may be used to describe geospatial concepts (e.g. two vehicles are close to or far from each other), causal concepts (e.g.vehicle A stopped moving because it encountered a red light) and temporal concepts (e.g.event A occurred before or after event B). [2] Logical Scenarios refine the abstract constraints of functional scenarios into constraints over parameter ranges or intervals, optionally accompanied by probability distributions.For example, geospatial functional constraints may be refined to areas on a map, and temporal functional constraints may be refined to time intervals.[3] Concrete Scenarios substitute concrete numeric values from the parameter ranges/intervals defined in a logical scenario.For example, concrete scenarios contain exact values for the position coordinates of actors, as well as exact times and durations for event executions.Given a specific concrete scenario executable in a traffic simulator, any abstract relation in functional scenarios can be derived by (i) identifying relevant logical constraints (e.g.geospatial, temporal), and (ii) assigning the truth value of abstract relations accordingly by predicate abstraction.Example 2: An initial traffic scene containing two actors o R and o B is described at various levels of abstraction in Figure 2 (c l and c u are constants).Functional and logical scenes define constraints (i.e. the problem), while concrete scene defines an instance (i.e. a solution) that satisfies these constraints.The problem is exclusively comprised of geospatial constraints, since no dynamic behavior is involved in the initial scene.The functional scene is defined using the abstract language proposed in Section 3. The logical and concrete scenes are handled in Section 4, while Section 5 describes our solution to such problems.

Scenario-based testing by simulation
Scenario-based testing by simulation [11], [17], [18], [19] is commonly used to evaluate the adherence of AVs under test to traffic safety requirements.In line with the definition of a traffic scenario proposed in Section 2.1 [14], scenario-based test cases are composed of an abstract Initial Scene Specification (ISS), abstract Behavioral/Temporal Constraints (BTCons) over actors along with Evaluation Criteria (ECrit), often defined as oracles [20] based on safety requirements.For a test case to be executable in simulation, its abstract constraints must be concretized.As a first step, (1) the abstract ISS is refined into a concrete scene.Then, (2) concrete behaviors, such as exact trajectories to follow, are assigned to each actor in accordance to BTCons.Finally, (3) the concrete scenario is simulated, and its success (pass/fail) is evaluated according to ECrit.For instance, a test may be considered successful if the ego vehicle can navigate its assigned trajectory without colliding with any other actor.
In this paper, we exclusively focus on Step (1), i.e., the automated concretization of abstract ISSs into realistic initial scenes.This is an important aspect of AV testing as the initial scene may have a direct impact on the outcome of test execution while assuming identical behavior for all actors.For instance, consider two test scenarios with different initial scenes (ISSs) but identical actor behaviors.In both cases, the ego actor o ego and a non-ego actor o B are driving at a high speed inside their lane (o ego is following o B ) while another actor o R abruptly cuts in front o B , forcing o B to do an abrupt brake.If, according to the ISS, o ego is close behind o B , these actor will collide.However, if the ISS specifies that o B and o ego are initially far from each other, collision will be avoided since o B will have time to slow down.Videos depicting the two test scenarios are included in an online publication page 1 dedicated to this paper.

Scene concretization in scenario-based testing
Our paper proposes an scene concretization approach where a functional-level (initial) scene specification is given as input and the concrete positioning of vehicles is constructed as output.For that purpose, abstract constraints specifying the functional scene are first mapped into an equivalent numeric problem (i.e. the logical scene) along the mapping of Section 4.2.Then, metaheuristic search (MHS) is used to derive a numeric solution (i.e. a concrete scene) that satisfies all related constraints.
Search-based test generation techniques [17], [19], [21], [22], [23] have been actively used to provide potentially dangerous concrete test scenarios as input for traffic simulators.As general assumption of these approaches, a single search process is conducted to find concrete scene parameters and actor behavior that leads to potential danger.Our paper investigates scene concretization as a standalone subproblem of the complex challenge of scenario-based testing, which complements existing work in three key aspects.
‚ Our abstract scene representation enables to evaluate the coverage of arbitrary automatically generated test scenarios with formal precision by qualitative abstraction for similar behavior of actors.‚ When a potentially dangerous scenario is found by existing test generators, our approach can provide whatif analysis by deriving a diverse set of initial scenes to investigate similar behavior of actors.Such analysis can help better understand what contextual parameters of the scene itself can contribute to potential danger.‚ Our approach enables to investigate traffic scenarios in a realistic context by concretizing scenes in concrete map locations.Demonstrating safe behavior of AV at a specific location can help geofencing, e.g. an AV is allowed to take a particular route on the map but not other routes.

FUNCTIONAL SCENE SPECIFICATION
Functional scene specifications (FSSs) are often captured by an abstract constraint language [11] that leverages qualitative abstractions [24], [25] of concrete scene attributes.In this paper, we adapt (4-valued) partial graph models as a FSS language using the syntax and formal semantics defined in [26].As a key benefit over state-of-the-art traffic scene concretization approaches, e.g.SCENIC [11], partial models enable the detection of inconsistencies at the FSS level.

Scene specification language
Vocabulary: Objects in a partial model correspond to actors of a scene.The relations between actors are captured by a finite set of relation symbols Σ " tR pos YR dist YR vis YR coll YR road u grouped into 5 geospatial relation categories: ‚ R pos " tleft, right, ahead, behindu are positional relations denoting the relative position of the target actor with respect to the heading of the source actor.‚ R dist " tclose, medDist, faru is a set of distance relations which qualitatively characterize the Euclidean distance between two actors (using x and y coordinates).‚ R vis " tcanSeeu is the visibility relation to capture if the target actor is in the field of view of the source actor.
‚ R coll " tnoCollu is the collision avoidance relation which denotes that two actors are positioned such that they are not overlapping (colliding).‚ R road " tonRoadu represents the unary road placement relation which denotes that an actor is placed on a road segment of the map which can be used by vehicles.The abstract relations listed above are adapted from the SCENIC specification languages [11], and similar abstract relations have been proposed in [24], [25], [27].To simplify presentation, abstract relations are restricted to binary relations as they are the most common in traffic scene specifications.Thus, the unary onRoad relation is represented as a (binary) self-loop relation.However, the proposed formalism can be generalized to n-ary relations and constraints, such as a ternary constraint specifying that the line of sight between actors o A and o B is obstructed by an actor o C .
Our approach assumes that these relations can be derived from concrete scenes by qualitative abstractions.Note that our approach is independent from the included concrete relations, therefore we can extend the set of relations by various sorts of abstractions from the concrete scenes.Moreover, we can also adjust relation categories accordingly.
Syntax and semantics: Given a vocabulary of geometric relations Σ, a FSS is a partial model P " xO P , I P y, where O P is the finite set of objects (each object corresponds to an actor), and I P gives a 4-valued logic interpretation for each symbol r P Σ as I P prq : O P ˆOP Ñ tfalse, true, unknown (unspecified), error (inconsistent)u.
A FSS consists of relation assertions, which explicitly assign a 4-valued truth-value to a binary relation over a pair of objects [28], [29].Syntactically, a relation assertion can be prefixed by the ?(unknown) and !(false) symbols, while no prefix represents true.
When multiple assertions to the same relation instance exist, the interpretation value is obtained by the 4-valued information merge operator ' [26], where contradictory information results in error while unspecified relations result in unknown.For instance, if a FSS contains both right(o A , o B ) and !right(oA , o B ), then the relation will be interpreted as Such error values detect inconsistencies in the FSS, i.e. they detect sets of constraints that cannot be satisfied by any concrete scene.For instance, right(o A , o B ) and !right(oA , o B ) cannot hold at the same time for any pair of actors xo A , o B y. error values also arise from more complex inconsistencies when enforcing domain-specific validity rules.
Validity rules: For a partial model P " xO P , I P y to represent a valid FSS, P must be refined according to five validity rules (V road , V loop , V sym , V pos , V dist ).V road states that all on-Road relations between two distinct actors are known to be false, since only self-looping onRoad relations are valid.V loop states that any self-loop relation (other than onRoad) is known to be false.V sym states that if a distance or collision avoidance relation r holds, then the same relation in the opposite direction is known to be true.V pos states that if a positional relation r1 holds between a given directed pair of actors, then all other positional relations r2 between those actors are known to be false.V dist is analogous to V pos , but is applied to distance relations.Formally: In this paper, we provide a sound but incomplete set of validity rules.If the enforcement of these rules produces an error, then the scene specification is surely inconsistent.However, if no error is produced, the scene specification is not ensured to be consistent.
The above validity rules are provided in accordance with the included abstract relations and they can easily be extended with new relations in the future.Additionally, custom validity rules may be defined to prevent semantic inconsistencies (i.e.physically infeasible specifications) or to enforce additional requirements (e.g.traffic laws).For instance, a custom validity rule V cust can capture that if a vehicle b is behind a vehicle a, then a cannot see b.All unspecified relations are originally interpreted as unknown.Additionally, positional and distance relations are included for every pair of actors (every oriented pair, in the case of positional relations).As such, according to the validity rules, all unspecified positional and distance relations are refined to true or to false.
For instance, the inclusion of close(o G , o B ) implies: ‚ I P pcloseqpo B , o G q " true; (from Vsym), ‚ I P pmedDistqpo G , o B q " false, I P pfarqpo G , o B q " false; (from V dist ), and ‚ I P pmedDistqpo B , o G q " false, I P pfarqpo B , o G q " false; (from both Vsym and V dist ).
Validity rules enable the detection of inconsistencies in the FSS.For instance, had the FSS also included far (o G , o B ), the application of V dist would result in an inconsistency:

Static analysis of scene specifications
Inconsistency detection: SCENIC compiles a functional scene specification into logical constraints, most of which define probability distributions over areas of the road map, and attempts to solve it by using rejection sampling [11].This A main benefit of the 4-valued semantics of partial models introduced in this paper for FSSs is to detect such semantic inconsistencies statically (at specification time).When contradictory assignments are given to a particular relation, they are merged automatically into the error value, which can be detected easily.Note that such a static detection of inconsistent specifications is a unique feature of our technique compared to related FSS approaches (e.g.SCENIC).
A sound qualitative abstraction from logical constraints to abstract relations (to be discussed in Section 4) ensures that whenever there is a (concrete) solution to the logical constraints, then the respective abstract relations will also evaluate to true.Consequently, our 4-valued static analysis technique also guarantees that if an inconsistency is detected in the FSS (by the error value), then no concrete solutions may exist for the logical constraints.In such a case, the underlying solver does not need to be called at all, which can result in significant time savings.Unsurprisingly, our static analysis technique does not have completeness guarantees, i.e. there may be a set of inconsistent logical constraints, which cannot be detected on the abstract level.
Restrictions of the SCENIC FSS language: While our FSS language builds on the scene specification language of SCENIC, it is important to note that the original SCENIC language has limitations wrt.(i) error detection capabilities, as discussed above, as well as (ii) soundness.
Due to limitations of the underlying scene concretization approach, the SCENIC language rejects certain valid (consistent) constraint structures, which poses limitations to the soundness of the approach.For example, actors may be the target of at most one positional relation, pointing from an already instantiated actor.As such, positional relations may only form a tree structure (and not an arbitrary graph), which, in certain cases, is not enough to formally distinguish two semantically different scenes.Figure 4 depicts a concrete scene that satisfies the SCENIC-expressible subset.When compared to Figure 1, which satisfies the full FSS proposed in Figure 3a, the benefits of a more expressive language become apparent.None of the relations excluded from the SCENIC-expressible subset are satisfied by the the scene depicted in Figure 4.As such, both of these scenes, which are semantically different, cannot be formally distinguished using the default SCENIC FSS language as they would both be represented by the FSS shown in Figure 3b.

C1
We propose a high-level traffic scene representation language with 4-valued partial model semantics.This enables static detection of inconsistencies at specification time, which is not offered by related FSS approaches.

LOGICAL SCENES AS NUMERIC PROBLEMS
The scene concretization problem can be represented as a numeric constraint satisfaction problem over actors on a logical-level scene.We introduce a formalization for logicallevel scenes, and propose a novel mapping from a FSS to a corresponding numeric constraint satisfaction problem.
As key benefit, our mapping is extensible to take any abstract functional relation as input and yields customizable numeric constraints as output.Additionally, our approach can be contextualized in any underlying road map.Existing such mappings often either provide restricted numeric constraints as output, or they are limited to approach-specific input constraints defined over a simplistic road map.

Numeric concretization problem
Formalization: A logical scene defines a numeric rectangle layouting problem that yields a concrete scene as solution.Formally, such a numeric problem N corresponds to a tuple N " xA N , C N , m N , D N y, where: ‚ A N is a finite set of actors where each a i P A N is an oriented rectangle defined as 5-tuples (see below), ‚ C N is a set of binary numeric (geometric) constraints c i p a 0 , a 1 q over actors (oriented rectangles) a i P A N , ‚ m N is a road map that restricts the range of position variables for actors in A N , and ‚ D N contains valid bounding box sizes (width, length) of actors as a finite set of floating-point pairs xw i , l i y.Note that we only handle binary numeric constraints in this paper to stay consistent with the functional-level binary relations proposed in Section 3.1.Nevertheless, our formalization can be generalized to n-ary constraints.
We approximate actors a i P A N as oriented rectangles over a map m N given as input.Formally, an actor a i is represented by a tuple a i " xx, y, h, w, ly, where: ‚ x and y are (floating point) variables that represent the center point of a i (where x P r0, m N .xsand y P r0, m N .ys).Here, m N .xand m N .yrespectively represent the width and length of the map m N , ‚ h is a (floating point) variable for the heading angle of a i (in radians, i.e. h P r´π, πs), ‚ w, l are (floating point) variables that represents the width and length of a i (where xw, ly P D N ).Deriving a numeric problem: A numeric problem N " xA N , C N , m N , D N y is derived from a partial model P " xO P , I P y of a FSS through a mapping f l : P Þ Ñ N between functional relations and numeric constraints such as the one proposed in Section 4.2.We derive f l by [1] providing the map m N and possible actor bounding box sizes D N as external inputs (parameters), [2] mapping each abstract object o i P O P to a corresponding numeric actor a i P A N , [3] populating C N such that ‚ for every positive relation r P Σ over objects o i P O P (i.e.where I P prqpo 0 , o 1 q " true), a corresponding numeric constraint c i p a 0 , a 1 q (as defined in Section 4.2) is included in C N .‚ for every negative relation r P Σ over objects o i P O P (i.e.where I P prqpo 0 , o 1 q " false), the negation of is a value assignment of the variables associated to all actors a i P A N (within the respective ranges) such that all constraints c i P C N are satisfied.The numeric values assigned to variables represent a concrete scene which is a concretization of the FSS corresponding to P .
The concrete solution s N of a numeric problem N can be abstracted into the partial model P s (of a functional scene) using the same set of logical constraints.The numeric constraint corresponding to each instance of relation r P Σ is evaluated on s N .Each constrain evaluation yields a Boolean truth value (true/false).These Boolean values derived from such a mapping lf : s N Þ Ñ P s define a concrete partial model that does not contain unknown values.
Soundness of a numeric solution: In the appendix, we prove the soundness of our approach, as captured in THEOREM 1.
THEOREM 1: For a numeric problem N " f lpP q derived from a FSS, the concrete partial model P s abstracted from a solution s N of N as P s " lf ps N q satisfies all relations in P (formally, P s refines P , i.e.P Ď P s [26]).Example 6: The scene concretization problem in Figure 2 is defined for 2 actors o R and o B , and 3 functional relations ahead(o R , o B ) and medDist(o B , o R ) (visualised as arrows), and a negative relation !right(o B , o R ).For better presentation, identical colors relate actors and constraints on different levels, and numeric actor representations on logical and concrete levels exclude width and length variables.A concrete scene (solution) is defined as an assignment of actor variables to satisfy the logical constraints, e.g.

Mapping functional relations to logical constraints
In this section, we describe the formal mapping from functional relations in a partial model (i.e.relation symbols in the abstract vocabulary Σ) to corresponding logical (numeric) constraints, alongside a visualisation depicted over actors is proposed in Figure 5.Our mapping builds on but also generalizes previous work by Menzel, et al. [25].
In  Collision avoidance relations: We reuse the definition of collision avoidance proposed in the SCENIC framework [11].Informally, the noColl(o A , o B ) relation states that the area within the bounding boxes of a A and of a B do not intersect.Bounding box analysis is outside the scope of this paper, therefore we define collision avoidance through a call to the SCENIC library function intersects( a A , a B ).

Road placement relations:
We also rely on the SCENIC library functions to define road placement relations.Informally, onRoad(o A , o A ) states that the four corners of a A (i.e. a A .corners) must be located on (contained by) a road that is part of the scene map m i provided as input.The scene map is segmented into multiple connected roads which are represented as complex polygons (e.g curved roads, or roads with varying width).Given a road segment r map that is part of the scene map, the contains(r map , a A .c) function call checks whether the point a A .c, which is a corner point of a A , is placed within the bounds of the polygon represented by r map .Once again, polygon detection is outside the scope of this paper, therefore we refer to SCENIC library functions.

Benefits of the mapping
While our mapping is based on existing research [25], it conceptually extends this baseline in three different aspects.
First, thanks to the use of partial models as FSSs, our mapping defined in Figure 5 is extensible to arbitrary qualitative abstractions from a concrete scene to an abstract scene on the functional level.As such, we can seamlessly incorporate additional relations proposed by safety experts that can be observed and measured over a concrete model.Moreover, our approach maps abstract functional relations to numerical constraints over 2-dimensional space with no geometric assumptions.Thus, our mapping is independent from the underlying road map, and it can be contextualized to any real physical location on a map.Existing f l mappings [1], [2] are typically hard-coded to the specific geometry of a particular (often simplistic) road map.
Finally, our mapping is customizable to the needs of a given scene.For instance, parameters such as the thresholds for closeness, or the angle for the field of view of an actor can be adjusted according to its types (e.g.pick-up truck or pedestrians).Such flexibility allows the generation of more realistic scenes compared to existing approaches that map functional relations to numeric constraints.

C2
We provide an extensible and customizable mapping from abstract functional relation to numeric constraints, which can incorporate arbitrary qualitative abstractions, and can be contextualized in any physical locations of a map.

USING MHS TO DERIVE CONCRETE SCENES
To derive a concrete scene from a FSS, the numeric scene concretization problem defined in Section 4 needs to be solved.Since exact algorithms, such as quadratic solvers, have failed to provide scalable results for similar problems [2], in this paper, we adopt metaheuristic search (MHS) algorithms, which are commonly used in the domain of AV testing [1], [17], [21], [22], [23], [30].
Formalization: A numeric scene concretization problem N is mapped to a metaheuristic minimization (MIN) problem that can be solved by a MHS algorithm.A MIN problem is formalized as M min " xV M , OF M y, where: ‚ V M represents a set of variables tv 1,1 , . . ., v m,5 u, where m is the number of actors (represented as 5-tuples) in N .The domain of each variable v i P V M is defined by a numeric range rl i , b i s. ‚ OF M represents a set of objective functions tOF 1 , . . ., OF n u defined over variables in V M that return non-negative numeric values.
At each iteration of the MHS algorithm, a set of candidate solutions is derived, which are assignments for each v i P V M to a value within the corresponding range.A candidate solution is a valid solution s min to M min iff all objective functions are minimal (i.e.zero).To resolve potential issues with precision of floating point variables, we require that the value of an objective function should be below a given threshold ą 0.
As is the case for existing research [31], [32], [33], we propose an unconstrained MIN problem formalization: all logical-level constraints are directly incorporated into objective functions.As such, no additional handling of constraints is required by the underlying MHS algorithm.
Deriving a MIN problem: Given a numeric problem N " xA N , C N , m N , D N y representing a logical scene as input, we define a corresponding metaheuristic minimization problem M min " xV M , OF M y as follows: [1] V M collects the variables v | a i .vthat define all actors   Each aggregation strategy partitions the set of numeric constraints C N into distinct subsets P i Ď C N that contain all constraints associated to a specific objective function.For instance, category aggregation would create a subset for each functional relation category, and actor aggregation would create a subset for each actor.
The choice of objective functions influences the appropriate search algorithms.For instance, single-objective optimization algorithms (i.e.genetic algorithm) are ideal for global aggregation, while multi-objective (e.g.NSGA-II [34]), many-objective (e.g.NSGA3 [35]) or custom optimization algorithms are ideal for other aggregation strategies.
Formally, given a subset P i of numeric constraints, an objective function is defined as: Pi pq is an arbitrary (non-negative) weight function that does not modify the minima.
‚ DF i pcq is the output of a distance function measured over a single constraint c, as detailed in Figure 7.
Our objective function is a weighted aggregation of distance functions.Different weights can help fine-tune different characteristics of solutions (e.g.realisticness, diversity).
Soundness: The soundness of our approach is captured by Theorem 2 (with a formal proof in the appendix): THEOREM 2: For a MIN problem M min derived from a numeric problem N , a solution s min to M min is also a solution to N (i.e.all numeric constraints are satisfied).

C3
We solve scene concretization as a metaheuristic minimization problem where objective functions are derived from various aggregation strategies.

EVALUATION
We conducted various measurements to address the following research questions: RQ1: Which MHS configuration provides the best scene concretization results in terms of success rate and runtime?RQ2: How does our approach compare to state-of-the-art scene concretization approaches with respect to success rate (RQ2.1),runtime (RQ2.2) wrt.different maps, and success rate wrt.increasing number of actors (RQ2.3)?RQ3: How does our approach scale/fail wrt.an increasing number of constraints?RQ4: How does our approach scale/fail when concretizing scenes with large number of actors?

Case studies
To answer these questions, we execute scene concretization campaigns over three road maps.

CARLA:
The CARLA simulator framework [5] includes multiple road networks alongside realistic depictions of the surrounding environment.We perform experiments over the Town02 road map (215ˆ217 units 2 ) included in CARLA, which represents a simple town consisting of "T junctions".

ZALAZONE:
The ZalaZONE Automotive Proving Ground [36] is a physical test track located in Zalaegerszeg, Hungary designed to conduct experiments related to the safety assurance of AVs.The test track consists of various subsections adapted to the testing of different facets of AVs.We perform experiments over a digital twin of the Smart city portion of the test track (270ˆ474 units 2 ), which features a focused urban layout with multiple complex intersections, roundabouts, parking lots and curved roads.
TRAMWAY: We also perform experiments over a reallife road network (258ˆ181 units 2 ) in Budapest, Hungary.A digital twin for this road network is derived together with an industrial partner using the public OpenStreetMap database.This road network features many multi-lane, complex intersections located over a dedicated tramway lane.

Compared approaches
We compare our proposed approach with three variations of the baseline Scenic approach.Our evaluation does not consider manual or semi-automatic scene concretization for comparative evaluation considering that, despite being conceptually simple for humans, manually concretizing scenes into simulation-ready representations (1) is very time-consuming, as it relies on trial and error, and (2) does not ensure formal correctness.Automated scene concretization approaches, such as the ones presented below, addresses both difficulties by (1) automatically synthesizing simulator-friendly scenes (2) that are guaranteed (by the underlying algorithm) to satisfy the mathematical definitions of the abstract constraints.
Scenic: As a baseline reference, we concretize FSSs represented through the SCENIC [11] specification language.For this purpose, we use the integrated approach based on rejection sampling proposed by the framework.However, Scenic has expressiveness limitations (i.e.certain scene specifications cannot be expressed, see Section 3.2).Therefore, we identify three variations of the Scenic approach which use different constructs to represent abstract relations.In our experiments, we remove the minimum number of relations to make the resulting scene expressible.
‚ SceDef (Default): Positional (and accompanying distance) relations are represented by the built-in constructs (referred to in the framework as VENEERs) at actor definition time which restricts the sample space to a line segment on the map relative to the position of other actors.An example of the numeric constraint corresponding to a VENEER is given in Figure 2 for the Additionally, the only expressiveness restriction is that cycles of positional relations cannot be represented.
Once a FSS is defined, the functional scene is concretized by rejection sampling.Collision avoidance and road placement relations are included as additional acceptance conditions.However, to ensure that the approach can solve the entire scene concretization problem (i.e.not only the expressible subset of the problem), we check whether a derived concretization satisfies the removed, inexpressible relations.This is implemented as an additional acceptance condition for the concretized scene that is evaluated after the termination of the default sampling approach.MHS: We implement the MHS-based approach pro-   [37], Polynomial Mutation (PM) [37], the Das-Dennis (DD) [38] approach for determining the number of reference directions n refDirs , Binary Tournament (BT), and Non-Dominated Sorting (NDS) .Furthermore, pr refers to probability and η refers to the distribution index, while in our implementation, n var = 2ˆn actors .posed in this paper using various objective function aggregation strategies.Aside from the global G, category C and actor A strategies proposed in Section 5, we also implement: ‚ weighted category aggregation WC (a variant of C) ‚ weighted dependency aggregation WD (two objective functions are defined according to the dependency structure of constraints type (see Section 6.6): collision avoidance and road placement combine for one objective function, while the remaining constraint types form the other) ‚ no aggregation ø (each constraint in the FSS defines a separate objective function ).In case of WC and WD, higher weight is given to constraint categories that have less dependencies (i.e.collision avoidance and road placement constraints).Specifically, the objective functions for WC are: where i is road and coll respectively, and ‚ OF pos , OF dist , OF vis " p where i is pos, dist and vis respectively.The objective functions for WD are:

and
where j represents the relation category of c, and c P R i is shorthand for c P C N |c " f lprpo A , o B qq ^r P R i .We evaluate our approach using three underlying MHS algorithms through the PYMOO Python library [39]: ‚ a single-objective genetic algorithm GA ‚ a multi-objective NSGA-II algorithm N2 [34] ‚ a many-objective NSGA-3 algorithm N3 [35] Table 1 provides an overview of the relevant hyperparameters and genetic operators used as part of our experimentation.Genetic operators are selected according to the default PYMOO settings.Population size and number of offsprings are selected according to preliminary measurements, which are included on the publication page.While small population sizes are rather unusual for MHS algorithms, we believe their good performance is attributed to the particularities of our experimental setup (e.g.MHS approaches are often designed to provide multiple partial solution, whereas in our experimentation a single, complete solution is retrieved).
The implemented MHS configurations handle all functional constraints of our scene specification language, and handle logical constraints of Section 4. As such, input scenes do not need to be adjusted, as is the case for Scenic approaches.A comparison of relation representations, and the handling of relation structures is provided in Figure 8. Furthermore, to stay consistent with the baseline Scenic approaches, (1) our experimentation runs are terminated once a single solution to the problem defined by the FSS is found, and (2) particular handling for same-scene concretization tasks is not implemented.

General measurement setup
To evaluate various concretization approaches, we randomly generated FSSs to be used as input: [1] Given a number of required actors, we create a preliminary FSS P pre used to derive the input FSSs for our experiments.P pre contains road placement constraints for each actor and collision avoidance constraints for each pair of actors, to yield realistic concrete scenes.Additionally, a maximum distance r between actors is established to avoid randomly generating scenes where no relations exist between actors.[2] We use the default sampling-based scene concretization approached proposed in SCENIC [11] to derive a numeric solution s pre for a numeric problem N pre " f lpP pre q.Note that different runs yield different numeric solutions for a same P pre given as input.[3] We derive a FSS P in by applying qualitative abstractions over s pre (i.e.P in " lf ps pre q).Qualitative abstractions are applied for all relation categories.As such, any positional, distance and visibility relations that hold in s pre are included in P in .Additionally, we know that P in is not contradictory, as it has at least one feasible solution (s pre ).[4] We use P in as input for our measurements.
Throughout our measurements, for simplicity, we consider actors with pre-defined length and width.Additionally, we refer to the road network to determine the expected heading at a given position in the map.We assume that each position has a single expected heading (we avoid non-determinism at intersections, where vehicles may take multiple paths that cross each other).As such, our scene concretization runs derive the position coordinates of each actor, and the heading is determined accordingly.
We performed the measurements on an enterprise server 2 .Measurements are run in a Python environment, and the garbage collector is called explicitly between runs.Furthermore, generated FSSs and concrete scenes, alongside corresponding measurements and figures are included in the publication page.

RQ1: Comparison of MHS configurations
Measurement setup: This experiment aims to determine which optimization algorithm and objective function aggregation strategy combination provides the best results when implementing our proposed approach.We perform measurements over the TRAMWAY map (which is shown in RQ2 to be of intermediate difficulty for the evaluated approaches) using scenes with 2, 3 and 4 actors (size) to compare 8 MHS configurations.We exclude larger scene sizes to ensure higher success rates and to enable cross-configuration comparisons.Each evaluated configuration is composed of an objective function aggregation strategy and a MHS algorithm selected accordingly.Single-objective optimization GA is used with G, which yields one objective function.Multi-objective optimization N2 is used for less than 4 objective functions: 2. 12 ˆ2.2GHz CPU, 64 GiB RAM, CentOS 7, Java 1.8, 12 GiB Heap A, WD.Many-objective optimization N3 is used for more than 3 objective functions: C, WC and ø.Additionally, we evaluate the N2-WC configuration (as preliminary measurements have provided promising results) and the N3-A configuration (as the number of objective functions increases with the number of actors).
For each scene size, we randomly generate 10 FSSs as inputs (see Section 6.3), and run each approach 10 times, for a total of 100 runs.A 10-minute time-out is set for each run.
Analysis of results: Success rate and runtime measurements comparing the MHS configurations are provided in Figure 9.Each configuration is depicted with a uniquely colored bar (success rate) and box (runtime).We depict the cumulative success rate of our experiments: scene-level success-rate is addressed in RQ2.3.
We determine which configuration is best suited for our proposed approach by evaluating the statistical significance of our results.For success rate measurements, we perform the Fisher exact test [40] to determine p-value and measure the odds ratio [41] for effect size, as suggested in existing guidelines [42] for comparing algorithms with dichotomous outcomes (i.e.success or failure).Analysis shows that (despite underperforming for 3-actor scenes), the N2-A configuration provides better success rate with statistical significance (p ă 0.05) compared to all other configurations except for N2-WC and N3-ø.However, effect size is not large for our measurements (between 1.845 and 4.378).
For runtime measurements, we perform the Mann-Whitney U-test [43] to determine p-value and measure the Vargha and Delaney's Â12 [44] for effect size.Additionally, fol- lowing existing guidelines [42], we only consider the times of successful runs.Analysis shows that the N2-A configuration provides better runtime with statistical significance (p ă 0.05) than the N2-WC and N3-ø configurations for all three scene sizes.Effect size is medium to large for our measurements (between 0.662 and 0.853).
Considering these results, we identify N2-A as the best configuration for our experimental setup and use it for comparison with Scenic approaches in RQ2.Furthermore, we select N2-A, N2-WC and N3-C as the top three promising configurations to be evaluated in scalability measurements (RQ3 and RQ4).Despite its slightly lower success rate, we select N3-C over N3-ø due to its significantly faster runtimes.

RQ1: Out of the 8 evaluated MHS configurations, N2-
A either provides significantly better success rate or better runtime compared to all other configurations.

RQ2: Comparing MHS with Scenic approaches
Measurement setup: This experiment aims to determine which approach completes scene concretization in reasonable time (2.1, 2.2).Given a particular scene specification as input, we also determine (2.3) which approach is most likely to succeed in concretizing it.We perform measurements over the 3 road maps using scenes with 2, 3 and 4 actors to compare the 4 concretization approaches.As in RQ1, we exclude larger scene sizes to enable enable cross-approach comparisons with higher success rates.Additionally, as discussed in RQ1, MHS refers to the N2-A configuration of our proposed approach.For each size and map, we randomly generate 10 FSSs as inputs (see Section 6.3), and run each approach 10 times (each with a time-out of 10 minutes), for a total of 100 runs.

Analysis of results (RQ2.1):
We compare the overall success rate wrt.different maps in the top row of Figure 10.Each figure contains cumulative success rate results for all four approaches, depicted in the corresponding color.For Scenic approaches, a run is considered to be successful if the approach provides a solution to the input partial problem that also satisfies the removed relations (i.e. the provided solution solves the complete problem).
Among the Scenic approaches, SceHyb consistently provides relatively high success rates.Nevertheless, for all maps, and for all scene sizes, the success rate is dominated by MHS.Particularly for 4-actor scenes, Scenic approaches are generally unable to provide any solutions, while the success rate of MHS varies between 56-75%.
We evaluate statistical significance of our results according to existing guidelines [42] as detailed in Section 6.4.For each configuration (map and scene size), we evaluate the statistical difference between MHS and the Scenic approach with the highest success rate.The statistical test results are shown in Table 2. p-values are lower than 0.05 for all configurations, and the lowest effect size is 7.0.Hence, the success rates are significantly higher (with large effect size) for MHS compared to Scenic approaches.RQ2.1:For success rates, MHS dominates all Scenic approaches with statistical significance.Scenic approaches reach their scalability limit at 4 actors with close to 0 success rate, while the success rate of MHS is still 56-75%.

Analysis of results (RQ2.2):
We compare the runtimes wrt.different maps in the middle row of Figure 10.Each figure contains measurement results for scenes with up to 4 actors.Results for all four of the approaches are depicted as color-coded box plots.
Is Scenic faster than MHS?For scenes with 2 or 3 actors, the MHS approach is generally slower than the Scenic approaches.We evaluate the statistical significance of our results according to existing guidelines [42] as detailed in Section 6.4.
For each configuration (map and scene size), we evaluate the statistical difference between the MHS approach and each Scenic approach (pairwise).For 2-actor scenes, all pairwise comparisons show a statistically significant difference ( p ă 0.05) with large effect size ( Â12 ą 0.85) in favor of the Scenic approaches.For 3-actor scenes, there is only a statistically significant difference in favor of SceDef for the CARLA and ZALAZONE maps and of SceReg for the ZALAZONE map (with large effect size, Â12 ą 0.7).
How much faster is Scenic?To evaluate how much faster Scenic approaches are, we multiply the Scenic runtimes by a constant factor c, then we evaluate the statistical difference between the new runtimes and the MHS runtimes (pairwise).If no statistically significant difference is detected, we conclude that the given Scenic approach is at most c times faster than the MHS approach.
For 2-actor scenes, our results show that SceDef is 141 times faster, SceReg is 19 times faster, and SceHyb is 13 times faster than MHS.However, while Scenic approaches are often very fast (1-10 milliseconds), MHS also provide results in reasonable time.
For 3-actor scenes, Scenic approaches are at most 2.7 times faster than MHS approaches.An exception is for the SceDef approach applied to the CARLA map, where SceDef is 122 times faster.However, in this case, there is a significant difference in success rate between the two approaches, favoring MHS.
For 4-actor scenes, very few data points exist for Scenic approaches as they failed to provide a solution in most runs.MHS data shows that the median runtimes of successful runs are 29.1s for TRAMWAY, 60.0s for CARLA and 232.8s for ZALAZONE.As such, We notice that the ZALAZONE map provides the most challenging scene concretization problem for MHS in terms of runtime, while also providing comparable or lower success rates as other maps (see RQ2.1).This is attributed to the large map size and complex structure (containing many unusual road segments), which affects the search space for MHS.RQ2.2:For scenes with 2 or 3 actors, Scenic approaches are 1-2 orders of magnitude faster than MHS, which can still provide results in reasonable time (with better success rates).For 4-actor scenes, only MHS is successful, with a median runtime of at most 232.8s(reported for the ZALAZONE map).
Analysis of results (RQ2.3):Success rates wrt.increasing number of actors are shown in the last row of Figure 10.Each figure aggregates data for all measurements (i.e. for all maps and scenes) performed for a given number of actors.
Considering that 10 measurements were performed per scene, each scene has a corresponding success rate (for each approach).Each bar in these figures represents the number of scenes where the associated success rate corresponds to the x-axis label.For instance, the bottom-left subfigure of Figure 10 shows that there are 19 (2-actor) scenes where SceDef provides a success rate of 0% or 10% (the leftmost bar).Similarly, there are 26 scenes where MHS provides a success rate of 100% (the rightmost bar).
For 2-actor and 3-actor scenes, (1) SceDef provides distributions skewed towards lower and higher success rates, (2) SceReg and SceHyb provide more uniform distributions, and (3) MHS provides a distribution skewed towards higher success rates.This shows that certain scenes cannot be concretized by Scenic approaches, while MHS provides at least 20% success rate for every 2-actor or 3-actor scene.
For 4-actor scenes, MHS provides a uniform distribution of success rates, while Scenic approaches most often cannot solve the concretization problems.In fact, MHS was able to provide at least 1 solution (i.e. at least 10% success rate) for every scenes with 4 actors.

RQ2.3:
The MHS approach provides at least one solution (i.e. a 10%+ success rate) for every input scenes (i.e.concretization problems) with increasing number of actors.For an arbitrary practical scene concretization problem, MHS is more likely to find a solution than Scenic approaches.

RQ3: Scalability analysis wrt. constraints
Measurement setup: This experiment aims to determine how the inclusion of additional constraints influences runtime and success rate for three promising MHS configuration.For this research question, our measurements are restricted to the most challenging 4-actor ZALAZONE configuration (see RQ2).As discussed in RQ1, we evaluate the N2-WC, N2-A and N3-C configurations of our approach.Furthermore, since Scenic approaches failed to handle scenes with 4 actors, we exclude them from our scalability measurements.
We use the same 10 scenes used for RQ2, however, we gradually build up the scenes by including all constraints of a certain type one by one and then performing measurements after adding each constraint type.Specifically, we start with scenes containing no constraints (ø), then we gradually add road placement (R), collision avoidance (C), positional (P), distance (D) and visibility (V) constraints until we reach the complete scene specification.
The order of adding constraint types is based on their dependencies: (1) collision avoidance (C) is irrelevant for realistic initial scene generation if vehicles are not placed on roads (R), ( 2) distance (D) and position (P) cannot be measured if vehicles are overlapping (C), and (3) visibility (V) is based on position (P).While, for each constraint type, the exact number of added constraints may vary between scenes, adding constraints by type ensures that, for a given scene, the number of constraints gradually increases.
As in RQ1 and RQ2, we perform 10 iterations per scene with a 10-minute time-out.

Analysis of results:
Measurement results for RQ3 are shown in the top row of Figure 11.For N2-WC, the gradual inclusion of constraint types results in a decrease in success rate and an increase in median runtime.Similar trends are also observed for N3-C and for N2-A, however, only up to the inclusion of distance (D) constraints.In all cases, these trends become particularly noticeable after For N3-C, the further inclusion of visibility (V) constraints results in an increase in success rate and a stagnation in runtime.Although these results might seem counterintuitive, they are in line with the behavior of popular SAT solvers where the increasing number of constraints does not necessarily correlate with the increasing complexity of the underlying constraint satisfaction problem [45].
For N2-A, the further inclusion of visibility (V) constraints results in a stagnation in both success rate and median runtime, which is attributed to the use of A as the objective function aggregation strategy.C and WC both produce a new objective function for each newly added constraint type, which explains the observed variation in success rate and runtime throughout the experiment.However, A produces a constant number of aggregation functions regardless of the included constraints.Furthermore, visibility constraints have a similar formalization as ahead() positional (P) constraints.As such, according to our random FSS generation approach described in Section 6.3, scenes that contain visibility constraints likely also contain ahead() constraints.Considering that ahead() constraints are already handled with the inclusion of positional (P) constraints, the addition of visibility constraints should not significantly influence success rate and runtime, as shown in our results.

RQ3:
Gradual inclusion of constraint types generally results in a performance decrease for the MHS approaches, particularly with the addition of positional constraints.Furthermore, the choice of objective function aggregation strategy influences scalability results wrt.constraints.

RQ4: Scalability analysis wrt. actors
Measurement setup: This experiment aims to determine the maximum size of a scene specification that our approach can successfully concretize in reasonable time.For this research question, we measure the scalability of the MHS approach by introducing more than 4 actors.As in RQ3, we evaluate the N2-WC, N2-A and N3-C configurations on the most challenging ZALAZONE map.
We perform measurements over 5 randomly generated input scene specifications with increasing size up to the point where the MHS approach is predominantly failing (i.e.under 10% success rate).We run concretization 5 times for each scene specification, for a total of 25 runs per scene size per MHS configuration.However, we increase the timeout to 2 hours for each run and measure the success rate and runtime of the concretization runs.

Analysis of results:
Measurement results for RQ4 are shown in the bottom row of Figure 11.Despite a high success rate for 4-actor scenes, the N2-A configuration is only capable of concretizing scenes with up to 5 actors.For N2-WC, scalability is limited to 6-actor scenes.N3-C is the most scalable configuration and can concretize scenes with up to 7 actors (our results show a 16% success rate for 7actor scenes).Additionally, despite the 2-hour time-out, the median runtime for the N3-C configuration for 7-actor scenes is 1375s (23 minutes).
Note that increasing the number of actors by one represents an exponential increase in the overall complexity of the scene concretization problem.Consider a scene containing m actors defined over a set of n directed binary relation symbols.The size of the entire search space is estimated (i.e.over-approximated) as p2 n q mpm´1q , where (i) 2 n is the number of possible relation combinations over an ordered pair of actors, and (ii) mpm ´1q is the number of ordered actor pairs in the scene.As such, the search space complexity is Op2 nm 2 q.In particular, the largest search space handled by MHS (for 7 actors) is 2 360 times (over 100 orders of magnitude) larger than the 3-actor space handled by Scenic approaches.

RQ4:
The scalability of MHS approaches is limited to scenes with 7 actors over the ZALAZONE map which solves a search space with 2 420 states.As such, MHS can handle an exponentially (over 100 orders of magnitude) larger search space compared to Scenic approaches.

Towards testing vision-based ML components for semantic segmentation
Although our initial scene concretization approach is a first step towards complete scenario-based testing of AVs, concrete scenes are commonly used for testing various components of AVs, such as cameras and LiDAR sensors.As a proof of concept, we provide initial results for the testing of the semantic segmentation functionality of a computervision component by integrating our proposed approach with the realistic CARLA simulator [5], which includes a simple AV stack that may be tested.Additionally, in an ongoing work, we use our scene concretization technique to evaluate three computer-vision components, namely Seg-Former [46], ANN [47] and BiSeNet V2 [48].However, the complete evaluation of these components is outside of the scope of this paper, which primarily focuses on the core scene concretization technique.(e) Metrics measurement data Fig. 12: Artifacts derived from using a generated scene to test the semantic segmentation capabilities of ANN Figure 12 shows an example where a 3-actor scene generated by our approach is used as a test case to evaluate a computer-vision component.For such scenes, we provide: [1] an RGB image of the scene (Figure 12a) from the ego vehicle's viewpoint (dashcam footage), [2] ground truth semantic segmentation (SEG) of the RGB image as provided by CARLA (Figure 12c), [3] predicted SEGs of the RGB image using the three listed computer vision components (a sample SEG using ANN is shown in (Figure 12d), [4] an overlay of the predicted SEGs over the RGB image (Figure 12b), and [5] initial measurement data for precision, recall and intersection-over-union (IoU) [49] for different classes of objects in the scene (Figure 12e).This measurement data is used to evaluate the performance of the computervision component under test.
In addition to these artifacts, we also include (online on the publication page): [6] a Scenic description containing exact positions of each vehicle (for reproducibility), and [7] a video of the scene running with the default AV stack included in CARLA.
Further initial results (for 30 scenes located on the ZA-LAZONE map containing 2, 3 or 4 actors) are also available on the publication page.These scenes are derived such that all non-ego vehicles can be seen by the ego vehicle, which increases the relevance of our auto-generated scenes for the testing of vision-based ML components (as each object has an impact on the test).

Threats to validity
Construct validity.Our approach represents a traffic scene as a set of pre-defined abstract relations.In this paper, we excluded certain relations associated to, e.g., vehicle size and orientation, but our proposed approach may be generalized to include those relations.Additionally, we use various approximations (i.e.actors are modeled as rectangles) in our definition of numeric constraints to simplify the scene concretization task.Nevertheless, we provide a complete conceptual basis for scene concretization with an extensible specification language and customizable numeric constraints.
We compare various MHS algorithms and objective function aggregation strategies to solve the derived MIN problem.We selected population size and number of offsprings based on preliminary measurements.Default values are used for all other parameters.
Internal validity.To strengthen internal validity, we explicitly call the garbage collector between scene concretization runs.Additionally, we derive the input functional scenes in such a way to ensure its feasibility on the tested map (to avoid contradicting relations).Our input scene derivation approach is only limited by an enforced maximum relative distance between actors, which does not restrict the diversity of the derived scenes.
External validity.We mitigate threats to external validity by performing measurements over 3 maps with different characteristics, derived from different sources (including a real test track and a real world location).We perform comparative evaluation of up to 8 MHS configurations, composed of 3 MHS algorithms and of 6 objective function aggregation strategies.We also compare our proposed approach to three variations of the state-of-the-art baseline approach, until their scalability limits are reached.
For RQ1 and RQ2, we perform a thorough evaluation (100 concretization runs per approach/configuration, per map, per scene size) to adequately compare the approaches and to cover a diverse set of scenes for each setting.We accompany our measurement with statistical significance analysis in accordance with the best practices proposed by Arcuri, et al. [42].We also perform thorough evaluation for RQ3 (100 concretization runs per MHS configuration, per set of constraint types).Our scalability analysis for RQ4 is restricted to MHS approaches (as the baseline Scenic approaches failed to provide solutions in RQ2).Here, we limit our evaluation to 25 concretization runs, over 5 scenes per size, which is still sufficient to illustrate the scalability of our MHS approach.

RELATED WORK
First, we overview existing abstract scenario specification languages that describe scenes and scenarios at a functional level (Figure 13a).We evaluate them with respect to (a) expressiveness (to represent arbitrary scenarios and constraints), (b) extensibility with custom traffic concepts, (c) handling of temporal, e.g.behaviors, (d) support for static error detection, and (e) prior use in scenario concretization.
We then evaluate existing (concrete) scenario generation approaches (Figure 13b) (as a generalisation of scenario concretization approaches) to check if they can handle (a) arbitrary abstract scenario specifications as input (i.e.assumptions about interactions between actors are not hard-coded into the approach), (b) adjustable numeric constraints, and (c) any underlying road map given as input.

Scenario specification languages
Existing scenario specification languages often build upon a conceptual basis for traffic scenarios proposed by Ulbrich, et al. [14], by Menzel, et al. [15] and by Steimle, et al. [50].These approaches describe the various components and abstraction levels required for scenario specification.At a conceptual level, scenarios are often defined through multilayer representations [24], [27], [51], [52] which provide a hierarchy for traffic scenario components.
Ontologies and models: Ontologies [24], [27], [53], [54] can provide a formal basis for functional scenario specification.A similar level of formality is provided by model-based scenario specification approaches [63], [64].Conceptually, these approaches represent an expressible and extensible specification language, but they are not often used in existing research as inputs for scenario concretization yet.
Temporal scenario concepts: Other specification approaches use temporal concepts as building blocks for scenarios.Such temporal concepts include (1) reasoning over vehicle paths [55], (2) sequence of vehicle behaviors [56] or (3) conditional state transitions between scenes [57].The expressivity and extensibility of these approaches are limited, but initial concretization results are often provided.
Input languages for generation approaches: Existing scenario genration approaches (detailed in Section 7.2), often define functional scenarios with a custom specification language that may include temporal concepts.However, such input languages are often tailored to represent a specific type of scenario (e.g. with a single maneuver decided  [58], [60], [61] Path Planning [59], [62] SCENIC [11] Our approach (b) Comparison of scenario generation approaches Fig. 13: Comparison of our approach with the existing state of the art.Notation -: yes, : to a certain extent, : no. a priori).As such, they lack in expressiveness and extensibility.An exception is SCENIC [11], which is thoroughly discussed in this paper.Despite its expressiveness limitations, SCENIC handles arbitrary scenario specifications as input.

Scenario generation approaches
Search-based approaches are most commonly used for scenario generation.Many-objective search can be used to test feature interactions in AVs [1], to perform efficient online testing [18] and to address the branch coverage of test suite generation approaches [65].Additionally, Ben Abdessalem, et al. rely on multi-objective search [21], and learnable evolutionary algorithm [17] to guide scenario generation towards critical scenarios.Critical scenarios have also been derived using a weighted search-based approach [22] and genetic algorithm [23].Similarly, DEEPJANUS [19] combines evolutionary and novelty search to derive test inputs at the behavioral frontier of AVs, while Babikian, et al. [2] use a hybrid, graph and numeric solver-based approach to concretize a limited-visibility pedestrian crossing scenario.Note that different search-based approaches may represent domain-specific constraints differently in the underlying algorithm [66] (e.g.objectives vs. hard constraints).Existing search-based approaches are often used to guide scenario generation towards test cases with particular characteristics.Our approach can complement such existing work as discussed in Section 2.4.Nevertheless, existing search-based approaches as standalone components are limited in various aspects.Such approaches are often designed for the concretization of a specific scenario type, which is selected upfront and then hard-coded into the search problem.In particular, numeric constraints are hard-coded according to (1) the specific interactions between actors and (2) the fixed underlying map structure.While our measurements clearly highlighted that the actual map location has a substantial influence on the de facto complexity of a scenario concretization problem, all existing search-based approaches are limited to a pre-defined map location, i.e. the underlying map is not given as input.Search-based approaches exist for map generation [30], but no actors are involved in such cases.
Sampling-based approaches have also been used for scenario generation.Certain approaches [58], [60] sample over a parametric (discrete) representation of arbitrary functional input scenarios, but they often avoid numeric (continuous) constraints (i.e. the logical scenario).O'Kelly, et al. [61] use Monte Carlo sampling (over a continuous domain) to simulate scenarios with rare events, thus reasoning directly at the logical scenario level.However, all these samplingbased approaches are limited to a fixed map location.
The SCENIC framework [11] also provides samplingbased concretization, but it improves on other samplingbased approaches by handling any road map as an input parameter.Limitations of the input language and of the underlying functional-to-numeric constraint mapping are discussed in Section 3.2 and in Section 4.3, respectively.
Path-planning approaches [59], [62] address scenario generation directly at the level of numeric constraints.As such, they cannot handle abstract constraints as input.These approaches often use formalizations of safety requirements as guiding metrics for scenario generation.Furthermore, despite the lack of experimental results, such path-planning approaches are in principle adaptable to any road maps.

CONCLUSIONS
In this paper, we proposed a traffic scene concretization approach that leverages metaheuristic search to place vehicles on an arbitrary road map (given as input) such that they satisfy a set of input constraints.Our approach handles traffic scenes on three different levels of abstractions in compliance with safety assurance best practices for autonomous vehicles.The input is a functional scene specification (represented in a novel scene specification language with 4valued partial model semantics) that captures the scene concretization problem by abstract relations, and enables early detection of inconsistent specifications.Then, the functional scene specification is mapped to a complex numeric constraint satisfaction problem on the logical level.Finally, we use metaheuristic search with customizable objective functions and constraint aggregation strategies to solve the numeric problem in order to derive a concrete scene that can be investigated in the popular CARLA simulator [5].
We carried out a detailed experimental evaluation comparing eight configurations of our proposed approach over three realistic road maps to assess success rate, runtime and scalability.Our results show that despite higher runtimes, our approach provides significantly better success rate and scalability than the state-of-the-art SCENIC tool, while traversing a search space with 2 420 states.
(Initial) scene concretization is a subproblem of the complex challenge of scenario-based testing of AVs.As such, our future work aims to integrate the handling of behaviors and dynamic constraints, potentially representable through temporal logic languages.Moreover, we plan to address the systematic synthesis of abstract functional scene specifications with certain coverage guarantees over the autogenerated specification suites.

Soundness of a numeric solution.
We demonstrate the soundness of a numeric solution by showing that: THEOREM 1: For a numeric problem N " f lpP q derived from a FSS, the concrete partial model P s abstracted from a solution s N of N as P s " lf ps N q satisfies all relations in P (formally, P s refines P , i.e.P Ď P s [26]).where N " xA N , C N , m N , D N y, P " xO P , I P y and s N : A N Ñ R 5 is formalized as in Section 4.1.
A key component of our soundness proof is the partial model refinement operation, denoted by P Ď P s , as defined in [26].Informally, this means that (1) all unknown relations in P are mapped to a true or false relation in P s , and (2) no other relations are modified.
Figure 14 provides a visual representation of our proof: Fig. 14: A solution s N of a numeric problem N " f lpP q corresponds to a refinement P s of P .
Proof.false.[4] (1) Considering that s N satisfies all positive and negative relation in P , then their truth value is unchanged in P s .Furthermore, (2) all unknown relations in P may or may not hold in P s (i.e. the corresponding numeric constraint may or may not be satisfied in s N ).In any case, these relations are refined either to a true or to a false value in P s .This shows that P Ď P s holds for any numeric problem N and solution s N .

Soundness of a MHS solution.
We demonstrate the soundness of a MHS solution by showing that: THEOREM 2: For a MIN problem M min derived from a numeric problem N , a solution s min to M min is also a solution to N (i.e.all numeric constraints are satisfied).where M min " xV M , OF M y, N " xA N , C N , m N , D N y, and s min is a value assignment (within the specified range) for each v i P V M .
Proof.A solution s min to M min is defined such that all objective functions OF i P OF M applied over s min evaluate to a value below a threshold ą 0. Note that OF i is constructed as a weight function applied over the sum of a set D of non-negative distance functions.Considering that a weight function returns non-negative values and does not modify the minima, we can deduce that OF i ă holds iff all the distance functions in D return a value below a threshold 1 ą 0. By definition, distance functions explicitly return 0 iff the corresponding numeric constraint c i P C N holds for the candidate solution it is measured on.Since all distance functions applied to s min return 0 (with precision 1 ), then all corresponding constraints (i.e.all constraints in C N ) are satisfied by the numeric solution s N represented by s min .

aFig. 2 :
Fig. 2: An initial traffic scene described at various levels of abstraction Vcust a, b P O P ^IP pbehindqpa, bq I P pcanSeeqpa, bq :" I P pcanSeeqpa, bq ' false Example 3: A functional-level scene specification is defined as a partial model P " xO P , I P y by the relation assertions in Figure 3a.The scene (P ) contains three car objects (o G , o B , o R P O P ), which are placed on a road and do not collide with each other.
Fig. 3: A functional-level scene specification

Fig. 5 : 2 )Fig. 6 :
Fig. 5: Mapping from functional relations to logical constraints a set of objective function derived from distance functions associated to numeric constraints C N .Distance functions: Each numeric constraint c P C N has a corresponding distance function DF i pcq which is derived according to the relation category of c (e.g.different functions for visibility and positional constraints).A distance function DF i pcq returns a non-negative number that represents how far a candidate solution CS is from satisfying c.

Fig. 7 :
Fig. 7: Informal overview of distance functions derived from positive functional relations

Fig. 9 :
Fig. 9: Success rate and runtime measurements for various MHS configurations on the TRAMWAY map

[ 1 ]
By construction, P maps to N such that each positive or negative relation in P maps to a corresponding numeric constraint in C N .[2] By definition, s N is a solution to N iff it satisfies all numeric constraints in N .This means that it also satisfied the corresponding relations in P (i.e.all positive and negative relation in P ) [3] Given the numeric solution s N , a corresponding partial model P s is derived by (1) enumerating abstract relations r i po a , o b q from every possible abstract relation symbol r i P Σ over every pair of actors o a , o b P O P , (2) deriving a numeric constraint c i for each abstract relation r i po a , o b q, and (3) evaluating c i over s N .If s N satisfies c i , then r i po a , o b q holds in P s (formally, I Ps pr i qpo a , o b q " true).Otherwise, I Ps pr i qpo a , o b q " P O P ^IP prqpa, bq " true I P prqpb, aq :" I P prqpb, aq ' true ; for r P R dist Y tnoCollu Vpos a, b P O P ^IP pr1qpa, bq " true I P pr2qpa, bq :" I P pr2qpa, bq ' false ; for r1 P Rpos, r2 P Rposzr1V dist a, b P O P ^IP pr1qpa, bq " true I P pr2qpa, bq :" I P pr2qpa, bq ' false ; for r1 P R dist , r2 P R dist zr1 road a, b P O P ^a ‰ b I P ponRoadqpa, bq :" I P ponRoadqpa, bq ' false ' false :" true ' false " error our notation, o A and o B define actors at the functional level (i.e.partial model objects), which are respectively mapped to actors a A and a B at the logical level to find a concrete solution (see Section 4.1).Corresponding relations and numeric attributes are color-coded.Customizable constants, such as θ l and d c , are depicted in black.Positional relations: When actor o A is connected to actor o B via a positional relation c P C pos , this signifies that the center point x a B .x, a B .yy of a B is located in a circular sector centered at x a A .x, a A .yy, with infinite radius.The orientation and central angle of the sector is defined according to the specific positional relation.For example, the left(o A , o B ) relation denotes a circular sector that covers the region located at the left of a A (with respect to its heading a A .h).With that respect, left(o A , o B ) means that a B is positioned to the left of a A .
Distance relations: A distance relation between actors o A and o B signifies that the Euclidean distance between the center point of each actor falls within a certain numeric range.For example, medDist(o A , o B ) requires that the Euclidean distance between x a A .x, a A .yy and x a B .x, a B .yy is a Functional rel.Logical constraint (with Relation category Distance functions pDF pos, . . ., DF road q rpo A , o B q| r P Rpos (DF pos): If o B is within the circular sector defined by r, DF pos returns 0. Otherwise, DF pos returns the angle relative to o A between the o A o B segment and the closest edge of the circular sector defined by r. rpo A , o B q| r P R dist (DF dist ): If both actors are at an appropriate distance, DF dist returns 0. Otherwise, DF dist denotes the shortest distance that o B must traverse to be positioned within the distance bounds required by r. rpo A , o B q| r P R vis (DF vis ): If o A can see o B , DF vis returns 0. Otherwise, DF vis denotes the shortest distance that o B must traverse for at least one of its corners to be in the field of view of o A .rpo A , o B q| r P R coll (DF coll ): DF coll returns 0 if o A and o B are positioned such that they do not collide, 1 otherwise.rpo A , o A q| r P R road (DF road ): If all four corners of o A are placed on a road contained in RoadMap, DF road returns 0. Otherwise, DF road denotes the shortest distance that o A must traverse for o A to be placed on a road.

TABLE 1 :
Overview of the hyperparameters and genetic operators used for each evaluated MHS algorithm.The table refers to Simulated Binary Crossover (SBX)

TABLE 2 :
Statistical test results comparing success rates: MHS vs. the best Scenic approach wrt.map and scene size (p: p-value, e: effect size).Effect size is large for all data points.
Fig.11: Scalability results (for MHS runs on the ZALAZONE map) wrt.constraints (for 4 actors) (RQ3) and wrt.number of actors (RQ4).We report measurement data for success rate and runtime.