Universality of Büchi Automata: Analysis With Graph Neural Networks

The universality check of Büchi automata is a foundational problem in automata-based formal verification, closely related to the complementation problem, and is known to be computationally difficult, more concretely: PSPACE-complete. This article introduces a novel approach for creating labelled datasets of Büchi automata concerning their universality. We start with small automata, where the universality check can still be algorithmically performed within a reasonable timeframe, and then apply transformations that provably preserve (non-)universality while increasing their size. This approach enables the generation of large datasets of labelled Büchi automata without the need for an explicit and computationally intensive universality check. We subsequently employ these generated datasets to train Graph Neural Networks (GNNs) for the purpose of classifying automata with respect to their universality resp. non-universality. The classification results presented in this article indicate that such a network can learn patterns related to the behaviour of Büchi automata that facilitate the recognition of universality. Additionally, our results on randomly generated automata, which were not constructed using the aforementioned transformation techniques and classified algorithmically, demonstrate the network’s potential in classifying Büchi automata with respect to universality, extending its applicability beyond cases generated using a specific technique.


I. INTRODUCTION
In 1962, Büchi introduced automata on infinitely long words [1] and showed the equivalence between the languages accepted by these so-called Büchi automata and ω-regular languages.In order to verify properties on systems, Büchi automata are used to model these systems, as the properties of infinitely long inputs nicely represent the indefinitely long running time of concurrent systems.
One of the main problems and algorithmic bottlenecks in automaton-based formal verification of systems is the complementation of Büchi automata [2], which is shown to be PSPACE-complete, with the complemetation algorithm that requires the state growth of at least O((0.76n)n ) w.r.t. the number of nodes. 1 With a different approach to automaton-theoretic verification focussing on language The associate editor coordinating the review of this manuscript and approving it for publication was Tyson Brooks . 1 With n being the number of nodes of the automaton to be complemented.
The goal of this article will be to show the progress made in our current research work, with the overarching goal of analysing if Graph Neural Networks (GNNs) are able to derive properties (beyond strictly structural graph-related properties like e.g.node reachability) from Büchi automata which are encoded as directed labelled graphs with additional state labels.The classification property in question will be the aforementioned (non-)universality of the input automaton, a property directly linked to the languages that are accepted by the automaton.This research extends our previous work [5], which showed that GNN are able to derive structural patterns from automaton structures for classification of structurally and computationally trivial properties.
Using the learning powers of deep learning architectures to solve problems related to formal languages or automata theory has led to various research results, e.g. using Transformers to predict solutions of linear-time temporal logic (LTL) formulas [6], learning transition rules of Cellular automata using a GNN architecture [7] or an approach to train neural networks to synthesize finite automata from a given language [8].To the best of our knowledge, our work on using GNN for automaton-level classification of Büchi automata is a novel addition to the field.
In the next section, we will present an overview of the required concepts and their respective definitions and how they will be used in this article.The third section will lay the formal foundations of our proposed transformation constructions allowing to generate labelled data with respect to universality.The fourth section will focus on the challenges of encoding Büchi automata as data elements for GNNs and on the creation of the different datasets used in the experimentation.In the fifth section, we will give details about the various experimental setups, both on a GNN level and on a dataset choice level and present experimental results.We will finish by discussing the results and propose future work in a concluding final section.

II. PRELIMINARIES A. BÜCHI AUTOMATA
The literature defines multiple types of automata on infinite structures [9], but this article will focus on non-deterministic Büchi automata (NBWs): Definition 1: Let NBW A be defined as a tuple A = (Q, , δ, q 0 , F), where • Q represents a finite set of states, • represents a finite set of symbols (called an alphabet), • δ : Q × −→ 2 Q represents the transition function, 2  • q 0 ∈ Q represents the initial state and • F ⊆ Q represents the set of accepting states.We will define the behaviour of this automaton over infinitely long sequences of symbols from the alphabet called ω-words.Such a word will produce one or more so-called runs over a given automaton.Before we continue, let us formally define ω-words, runs over NBWs, and the concept of infinitely many occurrences in an infinite sequence: Definition 2: Given a NBW A = (Q, , δ, q 0 , F), an ω-word w is defined as w = w(0)w( 1)w (2) . . ., where w(i) ∈ , ∀i ∈ N.
Let the set of infinitely often visited states in a run, i.e. in an infinite sequence of states, be defined as inf (r) = {q ∈ Q|∃ ∞ i ∈ N : r(i) = q}, where ∃ ∞ i denotes the existence of infinitely many i. 2 Where 2 Q denotes the powerset of the set of states Q.Before we can illustrate how the behaviour of a NBW can be expressed as a set of ω-words, we will need to define the concept of acceptance of a run: Definition 3: Let A = (Q, , δ, q 0 , F) be a NBW and let r be a run on A. Then r is called accepting if and only if An ω-word is accepted by A if there exists an accepting run r ofA on w.
Büchi automata can be represented graphically as state machines, with states illustrated by circles, accepting states by double-lined circles, the initial state with an unlabelled arrow pointing to it and transitions by labelled arrows.We can now look at the example automaton A ex1 from Fig. 1 and its behaviour on two example ω-words: We can see that over w 1 there is only one possible run on A ex1 , which is the run r w 1 = q 0 q 0 q 0 q 0 . .., the one staying in q 0 forever.Thus, from inf (r w 1 ) = {q 0 } and F = {q 1 }, we can conclude that r w 1 is non-accepting, and since it is the only run over w 1 , by consequence w 1 is not accepted by A ex1 .
Over w 2 there are infinitely many runs on A ex1 , depending on the non-deterministic jump from q 0 to q 1 .Even though a non-accepting run is possible (staying in q 0 forever) this time we can also find accepting runs, e.g.r w 2 = q 0 q 0 q 1 q 1 q 1 . .., concluding that w 2 is accepted by A ex1 .
Definition 4: Given a NBW A = (Q, , δ, q 0 , F), we define the language L ω (A) accepted by the automaton A as |w accepted by A}.
Following from [1], given any NBW A, the language L ω (A)isω-regular.Looking back at the example from Fig. 1, we can now reason about all the words that are accepted by this 140994 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.automaton, i.e. we can see that these words form the language L ω (A ex1 ) = {w ∈ ω |w contains finitely many a} = (a|b) * b ω .This article will focus on the universality property of Büchi automata, an important property in automaton-based verification [4], defined as follows: Definition 5: Given a NBW A = (Q, , δ, q 0 , F), we say that A is called universal if and only if

Contrarily, A is called empty if and only if
As mentioned in the introduction, the automata-theoretic approach to verification reduces the property satisfaction problem to language containment [3], where complementing a NBW becomes the computational bottleneck.There are many approaches to complementing NBW [10], [11].
The complementation problem requires two automata as input, and then checks whether one is the complement of the other.In order to avoid having two automata as input, we are considering in this article the closely related problem of universality, that is computationally equally hard, i.e.PSPACE-complete [12].In the universality problem, one checks whether or not a Büchi automaton given as input is universal.The relation to complementation stems from the relation: A is universal if and only if its complement Büchiaccepts the empty ω-language.

B. GRAPH NEURAL NETWORKS
In the past decade, the advent of deep neural networks has led to performance breakthroughs for numerous applications of machine learning and pattern recognition, including image classification [13], speech recognition [14], translation [15], and bioinformatics [16], to name just a few.Depending on the type of data representation, different network architectures have been proposed, notably convolutional neural networks (CNN) [17] for images, recurrent neural networks (RNN) [18] and transformers [19] for sequences, and graph neural networks (GNN) [20] for graphs.The latter are a natural choice for analyzing and classifying Büchi automata, since the automata can be represented in a straight-forward manner by means of labelled graphs (see Section IV-A).More specifically, in this article we focus on Relational Graph Convolutional Networks (RGCN) [21], which are ideally suited for graphs with a discrete set of edge labels (in our case: the symbols in ).The layers of an RGCN are described in more detail in the following section.

1) RELATIONAL GRAPH CONVOLUTIONAL LAYERS
The following notation will be used for the RGCN input data.Let a graph G = (V , E, R) be represented by a set of edge relations R, a set of nodes V ( with n = |V | the number of nodes) and a set of edges E (with m = |E| the number of edges).Let (v i , r, v j ) ∈ E, for v i , v j ∈ V and r ∈ R, be a directed edge pointing from v i to v j belonging to relation r.We define the neighbourhood of a node v ∈ V for an edge label r ∈ R as follows: Let X l ∈ R n×d (l) be the node feature matrix, with d (l)  denoting the number of node features at layer l.
An RGCN layer updates each node's feature representation by aggregating information from its neighbours in the graph.Let h (l) v ∈ R d (l) be the feature vector of node v at layer l.The RGCN layer computes a new feature vector h for node v at layer l + 1 as follows: where σ is an activation function, ×d (l) is the weight matrix for relation type r at layer l, ×d l is a self-loop weight matrix of a special relation type accounting for the node's own feature representation, and c r,v is a normalization constant.

2) GRAPH LEVEL CLASSIFICATION
With G = (V , E, R) being our graph, let h (L) v be the final feature vector of node v after L layers of the RGCN.The global pooling operation computes a graph-level representation h G ∈ R d (L) by applying an aggregation function g to the node representations: This feature vector may be further processed using a final linear transformation to obtain a vector of probabilities for each class.Specifically, we use a linear layer to map the graph-level feature vector h G to a vector z ∈ R s : where W f ∈ R s×d L is the final weight matrix, b ∈ R s is a bias vector and s is the number of classes in the dataset.The output vector z can be interpreted as the confidence of the graph belonging to each class, with a normalizing softmax function yielding the probability distribution vector ŷ ∈ R s as follows: 0 , W f and b, ∀r ∈ R and l ∈ N : 0 ≤ l ≤ L) of the RGCN layers by minimizing the difference between the predicted graph-level output ŷ and the ground-truth label y.
To do this, we define the cross-entropy loss function L (where, for an i ∈ N, y i denotes the i th element of vector y) as follows: which captures the discrepancy between the predicted label and the true label.To reduce this difference between predicted and true label of the training graph, we will compute the gradient of the loss function for each parameter as follows: The gradients for the other learnable parameters W (l) 0 , W f and b are computed similarly and are omitted here.Finally, an optimization algorithm using e.g.stochastic gradient descent is applied to update the parameters using the following rule (again only for W (l) r ): where η represent the learning rate, a parameter defining the step size of the learning, i.e. the ratio of parameter update with respect to its gradient.

III. NBW TRANSFORMATIONS
This section will introduce the concepts of preserving transformations on Büchi automata that do not violate universality (resp.non-universality), i.e. if such a transformation is applied on an universal automaton A, then the universality of the automaton will be preserved after that transformation.We define these transformations as follows: Definition 6: Let BA be the set of all Büchi automata and P be the set of all transformation parameters.Then in general, a preserving transformation

function that performs a transformation with the given parameters on the given Büchi automaton and outputs the transformed automaton (output automaton will be universal if and only if the input automaton was universal).
There also exist transformations that only preserve universality or non-universality, which leads to the following definitions: For transformations preserving universality, we get Similarly, the transformations preserving non-universality can be defined as follows: The following subsections introduce preserving transformations and prove that they preserve (non-)universality.Here is an overview of all the transformations: • Transformations preserving universality (TPU): -Add state: A new non-accepting state is added to the given automaton.-Add transition: A new transition is added to the given automaton.-Make state accepting: A non-accepting state is being made accepting.
• Transformations preserving non-universality (TPN) 3 : -Remove transition: An existing transition is removed from the given automaton.-Remove acceptance from state: An accepting state is being made non-accepting.
• Preserving transformations (TP): -Split self-loop: A self-loop transition loops through a newly introduced state and back.-Expand self-loop: New states are added to the automaton and a self-loop is replaced by connecting its state to the added states.

-Duplicate strongly connected component (SCC):
Copies an existing SCC and its behaviour and adds an interim accepting state.-Duplicate transition behaviour: Copies destination state of an existing transition and copies the destination state's outgoing transition behaviour.

A. TRANSFORMATIONS PRESERVING UNIVERSALITY
To start off, we will introduce transformations that are guaranteed to preserve universality.The goal here is simple, add elements to the automaton without changing the states and transitions that are already present.This guarantees that all words that are accepted by the original automaton will also be accepted by the transformed one, i.e. preserving universality.Definition 7: Let A = (Q, , δ, q 0 , F) be a Büchi automaton • Let q + ̸ ∈ Q be a newly introduced state.The TPU ''adding a state'' T u s is defined as follows: • Let q src , q dest ∈ Q be states of A and c ∈ .Then the TPU ''adding a transition'' T u t is defined as follows: Then the TPU ''make accepting'' T u a is defined as follows: T u a (A, {q a }) = (Q, , δ, q o , F ∪ {q a }) All 3 transformations that strictly preserve universality are relatively straightforward, and the proofs that all of them indeed preserve universality are trivial, 5 so the following lemma will be given without proof: Lemma 1: Let A = (Q, , δ, q 0 , F) be a universal Büchi automaton.Then The transformations in this subsection will guarantee that a non-universal input automaton cannot be made universal after applying these transformations, so as long as it is impossible that new words are added to the language accepted by the input automaton, these transformations are guaranteed to be preserving.Definition 8: Let A = (Q, , δ, q 0 , F) be a Büchi automaton • Let q src , q dest ∈ Q be states of A and c ∈ .Then the TPN ''removing a transition'' T n t is defined as follows: Then the TPN ''remove accepting'' T n a is defined as follows: Similarly to the case of the universal automata, both these transformations can easily be shown to preserve nonuniversality, proving the following lemma: Lemma 2: Let A = (Q, , δ, q 0 , F) be a non-universal Büchi automaton.Then We will now introduce transformations that are even language preserving, i.e. the language accepted by the transformed automaton is identical to the language accepted by the input automaton.From this follows that these transformations are preserving both universality and non-universality.The goal here is to change components or add components to the automata, while assuring that all runs that are added resp.changed do not change the behaviour of the automaton.
To start, we will have a transformation that splits self-loops over states: Definition 9: Let A = (Q, , δ, q 0 , F) be a Büchi automaton, let q sp ∈ Q be a state of A and let q + ̸ ∈ Q be a newly introduced state.Then the TP ''split loop''T l is defined 5 Every single run on any word that is present in the original automaton is also present in the transformed automaton.as follows: where and, for all q ∈ Q ∪ {q + } and a ∈ : The effects of this transformation can be seen in Fig. 2.
Let us now prove that this transformation is indeed preserving.The following Theorem will show language equality of the original automaton and the transformed one (by corollary also proving its preserving nature).
For a q sp ∈ Q, let Proof: To show ω-language equality, we will show both Let w = w 0 w 1 w 2 . . .∈ L ω (A) and let r w = r w (0)r w (1)r w (2)r w (3) . . .be an accepting run of A over w.We will show that from r w we can construct an accepting run r ′ w of A l over w.Let r ′ w be the run such that r ′ w (i) = r w (i), ∀i ∈ N, such that r w (i) ̸ = q sp .For handling the occurrences of q sp in r w , we will separate two cases, maximal subruns of only q sp in a run and a maximal suffix of only q sp .For all the occurrences of either of these two cases in r w : 1) let i ∈ N such that ∀k ≥ i: r w (k) = q sp Then by construction, let The resulting run r ′ w is, by construction, a run of A l over w.Furthermore, from w ) ̸ = ∅, since if an accepting state other than q sp is visited infinitely often by r w , it will so too by r ′ w and if only q sp is accepting and is visited infinitely often by r w , the accepting state q + will also be visited infinitely often by r ′ w .
and let be r w = r w (0)r w (1)r w (2) . . .be an accepting run of A l over w.By definition, ∃r ′ w run of A over w s.t.∀i ∈ N: From q + ∈ F sp ⇒ q sp ∈ F follows that r ′ w is guaranteed to visit an accepting state infinitely often and thus w ∈ L ω (A).
Let us now continue with a transformation that replaces a chosen state q c with a self-loop by a number of states that will have the same ingoing and outgoing transitions as q c but that will create a strongly connected component (SCC) with the self-loop transitions in order to grow the size of the automaton but without changing the behaviour.Formally: Definition 10: Let A = (Q, , δ, q 0 , F) be a Büchi automaton, let q c ∈ Q be an state of A. Let x ∈ N be the number of states to be added and ≤ i ≤ x} be the set consisting of the chosen q c and of newly added states, i.e.Q + ∩ Q = {q c }. Let sl the symbols that create a self-loop on q c , i.e. sl = {a ∈ |q c ∈ δ(q c , a)}.Then the TP ''expand self-loop'' T x is defined as follows:

Now, let
where In Fig. 3 6 (where the double resp.wavy edges represent a same set of in-resp.outgoing edges), we can see a snippet from an arbitrary automaton to see how this transformation works.This transformation allows the runs that would stay in q c (either finitely long or until infinity) to jump around between q c and all the newly introduced states, who all share the same outgoing transitions, so the runs do not change the runs that are visiting q c .Let us now formally prove that this transformation guarantees language equivalence: Theorem 2: Let A = (Q, , δ, q 0 , F) be a Büchi automaton.
For a q c ∈ Q and a transition function between these states δ + , let Proof: To show ω-language equality, we will show both Let w = w 0 w 1 w 2 . . .∈ L ω (A) and let r w = r w (0)r w (1)r w (2) . . .be an accepting run of A over w.
We will show that from r w we can construct an accepting run r ′ w of A x over w.For all i ∈ N, we can construct a r ′ w such that we have that Proof of the second case of this case separation: 6 Where q 140998 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
are two possibilities for the predecessor of r w (i): r w (i − 1) ̸ = q c .Then r w (i − 1) ̸ ∈ Q + and since r w (i) = q c , by transition rule (1), we have that r w (i − 1) = q c .From this, from w(i − 1) ∈ sl and from δ + being complete and deterministic follows from transition rule (2) that To show that r ′ w is accepting, it suffices to see that if an accepting state other than q c is visited infinitely often by r w over A, it also will be by r ′ w over A x .
Let w = w 0 w 1 w 2 . . .∈ L ω (A x ) and let be r w = r w (0)r w (1)r w (2) . . .be an accepting run of A x over w.By definition, ∃r ′ w run of A over w s.t.∀i ∈ N: w is guaranteed to visit an accepting state infinitely often and thus w ∈ L ω (A).
The next transformation will generate an exact copy of a strongly connected component of the automaton and a newly introduced accepting interim state that will be able to be visited only once, thus not changing the language of the automaton.An example will follow after the formal definition: Definition 11: Let A = (Q, , δ, q 0 , F) be a Büchi automaton, let C ⊆ Q be an SCC of A, let q c ∈ C be a state in C, let a c ∈ be a symbol with an outgoing transition from q c , i.e. δ(q c , a c ) ̸ = ∅.We introduce a copy C 7 of the given SCC and an accepting interim state q + (with q + and none of the states from C in Q).Then the TP ''duplicate SCC'' T d is defined as follows: where 7 With, for S any set of states: S = { q|q ∈ S}.
and dashes mark added states and transitions).
Before we have a look at the preserving properties of this construction, let us illustrate the transformation with an example in Fig. 4.
This transformation is also a language preserving transformation, which we are going to prove now: Proof: To show ω-language equality, we will show both L ω (A) ⊆ L ω (A x ) and L ω (A) ⊇ L ω (A x ): Let w = w 0 w 1 w 2 . . .∈ L ω (A) and let r w = r w (0)r w (1)r w (2) . . .be an accepting run of A over w.Then, by construction, there exists a run ) and r w an accepting run on A d over w.First off, we can state that if r w does not visit q + , by construction ∃ run r ′ w on A over w that will be identical to r w , from which follows that w ∈ L ω (A).Furthermore, by construction, for every accepting run visiting q + , there also exists an accepting run that does not visit q + (this follows from transition rule 2 of δ d ), i.e. we can derive the previous conclusion that there will be an accepting run on A over w, i.e. w ∈ L ω (A).
The next transformation duplicates the behaviour of a transition by introducing a new state that can only be reached by passing the duplicating transition, meaning that the FIGURE 5.With = {a, b}, A on top and A t = T t (A, [q src , q dest , a c ]) on the bottom.language accepted by the transformation remains unchanged.Formally: Definition 12: Let A = (Q, , δ, q 0 , F) be a Büchi automaton, let q src , q dest ∈ Q be states of A and let a c ∈ be a symbol with a transition going from q src to q dest , i.e. q dest ∈ δ(q src , a c ).Let q + ̸ ∈ Q be a newly introduced state.Then the TP ''duplicate transition behaviour T t is defined as follows: if q = q src and a = a c {q|q ∈ q s ∈δ(q src ,a c ) δ(q s , a)} if q = q + δ(q, a) else Before proving preserving properties of this construction, let us again illustrate the transformation with the example from Fig. 5: Being very similar to the duplicating SCC transformation, as for every possible accepted word, it is possible to find at least one accepting run that will exist both in A and A t (one not passing through q + ), we will omit the proof of the following language equivalence theorem: Theorem 4: Let A = (Q, , δ, q 0 , F) be a Büchi automaton.
For two states q src , q dest ∈ Q and a c ∈ , let T t (A, [q src , q dest , a c ]) Then

IV. BÜCHI AUTOMATA DATASETS
After gaining familiarity with the different transformations used to modify NBW while preserving their universality property, the automaton structures will now need to be encoded as input for GNN and populate a range of datasets.This section describes the encoding process for GNN, the construction of transformation datasets and the generation of randomly generated labelled automata.

A. NBW AS RGCN INPUT DATA
This section will detail the process of encoding Büchi automata as RGCN input data (as described in Section II-B).
Let A = (Q, , δ, q 0 , F) be a Büchi automaton.The RGCN input graph G = (V , E, R) is then defined as follows: • set of nodes V is equal to the set of states of the automaton, i.e.V = Q.
• set of relations R will be the set of symbols , i.e.R = .
• set of edges E will be constructed as follows.Let (v i , r, v j ) ∈ E, for v i , v j ∈ V and r ∈ R if and only if The initial node feature matrix X 0 (i.e. for layer l = 0) will, ∀v ∈ V , consist of the feature vector h (0) v defined as follows: v init v acc , where is an accepting state.These initial node features are all the features used to define the automata.The node features of the hidden layers will be node embeddings learned by the model.

B. TRANSFORMATION DATASETS
This section will give details about the process of generating labelled datasets using preserving transformations and the used functions, a process which can be consulted in Algorithm 1.
Given a number of automata d ∈ N, a minimum and maximum size n min , n max ∈ N, a set of labelled universal base automata A u , a set of labelled non-universal base automata A nu and a set of transformations T with an accompanying set of weights W T ⊆ N, the dataset generation procedure generates d-many different automata with a random generating bound between n min and n max by randomly choosing a universal base automaton A base ∈ A u for the first d 2 automata and choosing a non-universal base automaton A base ∈ A nu for the second half, then applying a randomly chosen transformation T from the transformation set T using the weights W T until the bound of desired nodes is surpassed.To make all the states in the automaton relevant regarding its language acceptance, i.e. to remove structural noise for the GNN that doesn't change the behaviour of the 141000 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.T ← random_weighted_choice(T, W T ) 13: automaton ← apply_T (automaton) return dataset 19: end function automaton structure, the resulting automaton is pruned before being added to the dataset.These various functions from Algorithm 1 are defined as follows: • The function 'random_integer(n min , n max )' returns, for n min , n max ∈ N, a random integer from [n min , n max ] with a uniform distribution.
• The functions 'random_choice(A u )' and 'random_ choice (A nu )' return a random automaton from the sets containing automata A u resp.A nu with a uniform distribution.
• The function 'random_weighted_choice(T, W T )' chooses a transformation from the set of transformations T = {T 1 , T 2 , . . ., T n } with the probability where W T = {w 1 , w 2 , . . ., w n } is the set of non-negative integer weights.
• The function 'apply_T (automaton)' applies the transformation T to the given automaton.Depending on T , the corresponding transformation parameters are also passed to the function, in accordance with the definitions from Section III.
• The function 'prune(automaton)' prunes the given automaton by removing all nodes which are not reachable from the initial state or which have no path to a self-reaching state. 8An example of this procedure can be seen in Fig. 6.

C. ERDÖS-RENYI DATASETS
Although the universality check for Büchi automata is computationally complex, it is still feasible for small automata.This allows us to generate correctly labelled datasets with randomly generated small automata, providing us with ground truth data that can be used as base automata for our transformation datasets, as testing and validation datasets for comparative analysis and to improve network performance through inclusion in training.In this section, we will first highlight how these small automata are randomly generated and then show the computational limits of the universality check with respect to the automaton size.

1) RANDOM GENERATION OF NBW
The random automata generation is based on the Erdös-Rényi graph model [22], where a graph G(n, p) is defined as a graph with n nodes and all possible edges are included with probability p.We extended this approach to NBW, by counting all possible edges of the graph structure once for each symbol in .In addition, a second probability p acc is defined to determine for each state if they belong to F, i.e. are accepting.Furthermore, we allow for more variety in the generation by letting the input be an upper and lower bound for node sizes, edge probabilities and acceptance probabilities (with n, p and p acc being randomly chosen in between these bounds) and reduce the automaton structure to only include states that are self-reachable and reachable from the initial state by applying the same pruning procedure seen before.

2) ALGORITHMICALLY CHECKING UNIVERSALITY
As it was already discussed in Section II, the complexity bottleneck of the universality check is the complementation construction of NBW, as it yields a node growth of O(0.76 n n ).For our implementation, we used a simplified version of the optimal construction from [23], which is slightly slower in the worst case.The emptiness check of the complement can subsequently be done in linear time to determine universality of the original automaton.
Table 1 shows the computation times and complement automaton sizes for 1000 constructions of complement NBW with 6, 8, and 10 nodes.The slowest construction for each size of the starting NBW took a significant amount of time, with the slowest construction for an NBW with 10 nodes taking over 2 and a half hours to complete.The average  computation times for each size of the starting NBW were 0.692 seconds, 1.32 seconds, and 14.99 seconds for NBW with 6, 8, and 10 nodes, respectively, which is consistent with the complexity of the algorithm.It is also noteworthy that the random generation of the input NBW influenced the worst case computation times, with the slowest computation time for NBW with 6 nodes being slower than the slowest time for NBW with 8 nodes, due to the random nature of the generated NBW.

V. EXPERIMENTAL RESULTS
We have conducted several experiments to investigate the potential of using GNN for predicting universality of Büchi automata.In the following, Section A describes the datasets of automata used, Section B presents the hyper-parameter optimization for the GNN, and Section C discusses the classification results obtained.
All the experiments regarding the construction and generation of datasets were conducted on a laptop with a Intel(R) Core i7 CPU processor with 16GB RAM.All RGCN models described in this section were trained on up to 8 NVIDIA V100 GPU cores with 32GB RAM each.

A. DATASETS
This section will start by presenting the datasets used in the experiments and establish consistent naming conventions throughout this section.
Let us start by defining the transformed TRF datasets, which were constructed using the transformation procedure outlined in Section IV-B.These datasets are primarily differentiated by the number of data elements they contain and by the size of the automata, determined by the parameters n min and n max for the dataset generation (given in parenthesis) and denoted by a S, M, L or A at the end of the dataset name abbreviation: • small (7, 7) dataset name TRFS, • medium (30, 40) dataset name TRFM, • large (70, 80) dataset name TRFL, • all (10, 80) dataset name TRFA.Fig. 7 presents distribution charts depicting the sizes of automata and their class membership per automaton size for the datasets comprising 2500 automata (utilized as test sets throughout the experiments) across the three automaton sizes.It is interesting to note that the automata from the transformations procedure, due to the pruning of the transformed automata and the nature of the transformations, that are smaller are more frequently non-universal.
These datasets use only the preserving transformations 9 from Section III-C to expand both universal and non-universal base automata.This minimizes the bias towards transformation patterns that the transformations preserving (non)-universality would introduce into the structure  of the automata.The generation parameters and dataset statistics of the transformed datasets can be examined in Table 2.
The Erdös-Renyi ER datasets containing randomly generated and labelled automata are mainly differentiated by the amount of data in the datasets (as Section IV-C showed, these datasets can only contain small automata due to computation complexity restraints during their generation).Table 3 provides the dataset statistics.
With the aim of encompassing the variability introduced by our different data generation strategies, we combined the NBW from the training sets from each approach (adding up to approximately 75000 data elements).This merged dataset 10  gives another data configuration exploring the impact the dataset creation strategy may have on the performance of the trained model.Fig. 8 gives an overview of the distribution of the number of nodes (per universality class) over the dataset.

B. NEURAL NETWORK ARCHITECTURE AND PARAMETERS
In this section, we describe our choices regarding network architecture and hyper-parameters, which were optimized with respect to the classification accuracy on a validation dataset of 10k ER automata.The choices of the optimizer, Adam [24], the activation function (ReLU ) and the normalization constant c i,r = |N r i | follow the experimental conventions detailed in the RGCN paper [21].The remaining parameters are the number of node features, the number of hidden layers, the learning rate, the number of epochs and the neural network architecture.They were optimized experimentally as follows: • Number of hidden layers L and their number of node features (i.e.d (l) , for 1 ≤ l ≤ L): L = 4 and all d (l) = 128.To determine the number of layers L and the amount of node features per hidden layers, for all combinations of L ∈ {1, 4, 7} and d (l)  ∈ {64, 128, 256, 512, 1024}, 3 models were trained on all transformed training datasets (sizes 1000, 5000 and 10000) and their averaged classification accuracy (to minimize the effect of the randomization of initial learnable weights) on the validation set was computed to determine the best performance.For the models with 4 hidden layers, the classification accuracy over all the different trained models was higher than for both 1 and 7 hidden layers, averaging at 78.89% accuracy (compared to 74.18% for one hidden layer and 75.09% for 7 hidden layers).Thus, after fixing the number of hidden layers to be 4, let us analyse in detail the classification accuracies for the various numbers of hidden node features, presented in Table 4.These results show that over the various training datasets, the highest averaged classification results were achieved by models containing 128 node features for each hidden layer.
With the fixed layer parameters, we conducted several experiments regarding the learning rate.The results of these comparative experiments, where 3 models were trained for each of the transformed training sets using the following learning rates: 0.01, 0.005, 0.001, 0.0005.The average classification over the 3 models on the validation set for each configuration is represented in Fig. 9.These results show that the learning rates of 0.01 and 0.005 perform worse on the trained models, but between  0.001 and 0.0005, the choice of the learning rate does not lead to very significant classification accuracy changes, so it is going to be fixed on 0.001 by having a slightly higher average classification accuracy (79.85% opposed to 79.63%), which incidentally is also the proposed choice of the paper introducing the Adam optimizer.
• Number of epochs: Flexible (maximum 100) The rolling average classification accuracy over the validation set for the past 5 epochs is calculated after each epoch.After 100 epochs, the model configuration after the epoch with the highest rolling average is output.
During the training process, the models start overfitting to the training set.This is illustrated by Fig. 10, where in both cases after the initial rise in classification accuracy, the accuracy keeps to improve on the training data, but slowly converges on the validation set, which is why the model at the point in time of highest average accuracy over the validation set is output.
• Network Architecture: Relational GCN With the hyperparameters for the network training fixed, Table 5 will show the classification results when using two different layers for training, GCNConv and RGCN layers.For GCNConv, the only change in the automaton encoding stems from encoding the transition labels, as this architecture requires an edge feature vector of length | | one-hot encoding the transition symbol (whereas RGCN edges are labelled with a relation) where the i th bit is set to one if and only if the transitions reads the i th symbol in .These results show that when trained on small automata, the GCNConv layer using the one-hot encoded edge labels perform similarly to the relational approach, but fall off more visibly when training on larger automata, where the need to infer structural reasoning is larger, as the verification set automata are also small.

C. CLASSIFICATION EXPERIMENTS
With the parameters of the learning process fixed, we can now present the classification results with immediate analysis of each experiment given the following combinations of training and test sets: • Training on TRF automata and testing on TRF automata.
• Training on TRF automata and testing on ER automata.
• Training on ER automata and testing on TRF and ER datasets.
• Training on the merged set and testing on TRF and ER datasets.This result analysis allows us to show the impact the choice of training set (and the strategy used behind creating the data) has on the capabilities of a model to recognize patterns behind the (non-)universality of the given input NBW.
To start, let us look at the classification results of models being trained on all the different transformed dataset.The first results are given for test sets containing transformed datasets and can be consulted in Table 6.
There are a few things to note from these results: • Solid performance when testing on other transformed datasets.Larger training sets lead to better classification accuracy over most test sets.
• Good generalization for larger transformed automata.The models learning on small NBW are able to generalize the structures needed to derive (non-)universality well for automata containing more nodes.
141004 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Table 7 shows the classification results when tasking models trained on various transformed automata datasets to classifying Erdös-Renyi automata.
We can now add the following observations: • Drop-off for classification accuracy on randomly generated NBW compared to the transformed automata test sets, especially when datasets are larger.The structurally less ''predictable'' randomly generated automata seem harder to classify, thus opening the conjecture that these models recognize more the transformation patterns and less the patterns connected to (non-)universality.
• Removing the segregation into datasets containing small, medium and large automata and by vastly augmenting the number of data elements that the model has access to to learn the automaton structures (i.e.training on 300k TRFA automata), we can see that the model strongly minimizes classification errors on the transformed test sets, but doesn't produce significantly better results on the randomly generated NBW.Another observation that is in line with the data generation can be made by having a look at the distribution of the errors of the classifications (i.e. the distribution of misclassifications into true and false positives), which can be consulted in Table 8.
Analysing these distributions show that the models that are trained on transformed automata are producing a bias towards non-universality when the automata are small, thus increasing the percentage of false negatives when tested on ER automata, which are exclusively smaller automata.This bias is a logical consequence of the size distribution as seen earlier in Fig. 7, which showed that small transformed automata are more likely to be non-universal, due to the generation process.
To continue, we will have a look at the performance of a model that is trained on the randomly generated NBW and how it performs when tasked to classify the various test sets.The results are presented in Table 9.
Here we see that the models trained on randomly generated and algorithmically classified Erdös-Renyi automata get a strong classification accuracy on the testing set containing different randomly generated automata, with a weaker classification of the small transformed automata and a substantial decline for the medium and large transformed automata, showing a lack of generalization towards larger automata.
The final results will show the classification accuracy of the models trained on the merged dataset when tasked to classify the various testing datasets.In Table 10, we can see that the training set containing a variety of differently constructed data elements leads to the best classification accuracies for the transformed automata, and reaching very strong classification results on the 2500 randomly generated ER automata.This shows that the most promising way to infer (non-)universality from a general structure of a given NBW is to train on sets containing a large variety of automata with respect to their generation.

VI. CONCLUSION
This section will recap the results that were acquired previously, give a concluding overview of the contents of this article and present various ideas for future research work.

A. RESULTS AND CONCLUSION
Based on the achieved results, we can assert that our contribution is twofold and can be summarized as follows: • The dataset generation approach with the transformations allows to quickly generate datasets for GNN training that lead the models to a basic understanding regarding the universality problem, with solid generalization results when applied to larger automata generated using the same methodology.The drawbacks in classification accuracy of randomly generated automata can be mitigated by creating larger training datasets incorporating a mix of transformed automata and randomly generated (algorithmically classified) ones, significantly improving classification accuracy.
• We showed that the RGCN architecture for classifying graphs can be applied to Büchi automata structures, a special case of a graph.The experimental results show that the models, when training on variously constructed datasets containing NBW with the goal of learning about (non-)universality, behave the way that is expected when changing different parameters (both on the dataset construction and on the models themselves) and reach surprisingly strong classification accuracies on randomly generated Büchi automata when trained on a training set combining randomly generated automata with transformed automata.The presented results and the preceding experiments also exposed a few limitations of the current approach to this problem: • The computational cost associated with randomly generating labelled automata, i.e. our ER datasets, has led to a limitation wherein these datasets exclusively comprise automata with a maximum of 9 nodes.As a result, the findings based on these ER automata, utilized as ground truths, only permit discussions within the context of samples containing relatively small automata.
• The strategy of generating data using transformations is inherently introducing a bias into the generated data, which may lead to overfitting of the GNN to specific transformation patterns.Nevertheless, the classification accuracies remained solid when tested on randomly generated ER automata, indicating that the proposed transformations did not lead to strong overfitting.

B. FUTURE WORK
The results that were shown in this article are mainly a show of proof of concept by presenting GNN models that were trained on graph structures representing Büchi automata for a classification task that is shown to be computationally expensive.With this work, to our knowledge, being the first time in literature to classify a given Büchi automaton using GNNs regarding their universality, there is a lot of potential for follow-up work to improve both the data generation and the training process.In this section, we will propose several ideas to serve as starting points for future research endeavours.
Let us start with a few ideas in regards to data generation: • Improve the algorithmic classification of randomly generated automata to create a bigger dataset containing more possible NBW as a baseline testing dataset.
• Analyse the effect on classification of manipulating the weights on the transformations, the impact of including the transformations that exclusively preserve (non-)universality or the removal of the pruning procedure of the automata.
• As more classification errors occur on universal automata in the presented results, experiment on training models on unbalanced datasets containing more universal automata in order to reach structural conclusions regarding universality.The process of training GNN also presents different opportunities for further experimentation: • Use easily checkable features of nodes (e.g. is contained in a SCC, has a self-loop, . . . ) to add more structural information to the base node features to facilitate the model's learning.
• Analyse in more detail the node features after the readout to learn from the information that the model is gathering during the message passing.
• Due to the rapid development of GNN architecture research, experiment with different models and compare results, e.g.transformers using self-attention.The procedure of classifying NBW presented in this article could also be subject to comparative research with the results from applications from formal verification requiring a universality check.Generally speaking, we encourage subsequent experimentation on this topic, both in regards of the dataset generation and in the training and testing of the models, to enhance the performance of the datasets or the scope of the dataset generation.

s j=1 e z j , e z 2 s
TRAININGThe goal of the training of an RGCN model is to optimize the trainable parameters (W (l) r , W

FIGURE 2 .
FIGURE 2. With = {a, b}, A on the top and A l = T l (A, {q 1 }) on the bottom.
be a given complete and deterministic transition function for the states in Q + on sl .

FIGURE 3 .
FIGURE 3.With = {a, b}, A on the left and A x = T x (A, {q c , 2, δ + }) on the right.

FIGURE 7 .
FIGURE 7. Distribution charts of automaton sizes (per class: blue are non-universal, red are universal) for the 3 transformed datasets containing 2500 small, medium or large automata.

FIGURE 8 .
FIGURE 8. Merged dataset automaton size distribution (per class: blue are non-universal, red are universal).

FIGURE 9 .
FIGURE 9. Averaged accuracy comparison over 3 models, each for various learning rates.

FIGURE 10 .
FIGURE 10.Example of training set and validation set classification accuracy and loss function evolution during training over 1k TRFS or 10k TRFL automata.

TABLE 4 .
Average classification accuracies (in %, tested on validation set) over 3 models with 4 layers over various TRF training sets and numbers of node features.

TABLE 5 .
Average classification accuracies (in %, tested on validation set) over 3 models trained on two different architectures.

TABLE 8 .
Distribution (in %) of the misclassifications of various training sets when tested on 2500 ER automata.