SeqNAS: Neural Architecture Search for Event Sequence Classification

Neural Architecture Search (NAS) methods are widely used in various industries to obtain high-quality, task-specific solutions with minimal human intervention. Event Sequences (EvS) find widespread use in various industrial applications, including churn prediction, customer segmentation, fraud detection, and fault diagnosis, among others. Such data consist of categorical and real-valued components with irregular timestamps. Despite the usefulness of NAS methods, previous approaches only have been applied to other domains: images, texts or time series. Our work addresses this limitation by introducing a novel NAS algorithm — SeqNAS, specifically designed for event sequence classification. We develop a simple yet expressive search space that leverages commonly used building blocks for event sequence classification, including multi-head self attention, convolutions, and recurrent cells. To perform the search, we adopt sequential Bayesian Optimization and utilize previously trained models as an ensemble of teachers to augment knowledge distillation. As a result of our work, we demonstrate that our method surpasses state-of-the-art NAS methods and popular architectures suitable for sequence classification and holds great potential for various industrial applications.


Introduction
Motivation.Event Sequences (EvS) with marker and timing information are very common in real-world applications such as medicine [1], biology [2], social medial analysis [3], fault diagnosis [4,5], churn prediction [6,7], customer segmentation [8], fraud detection [9] and more.Consequently, there is a demand to model such data.Figure 1: An example of the marked temporal event process.Event i occurs at time t i and is characterized (marked) by the feature vector x i .
In [5], the authors focus on predicting failures using telemetry data as a EvS classification task.They emphasize the significant benefits of even predicting a small fraction of these incidents, including improved availability, cost reduction, and avoidance of reactive maintenance.
Based on the results presented in [1], applications of Automated Machine Learning (AutoML) in healthcare require additional research and development.The utilization of automated methods in medicine holds great potential for substantially enhancing accuracy, with even minor improvements carrying significant weight in the healthcare sector.
In the field of biology, representing biological datasets as sequences is common.In [2], the authors develop an end-toend automated machine learning tool specifically designed for explaining and designing biological sequences.
In [10], the authors explore a wide range of machine learning applications for enhancing efficiency in the financial sector.These technologies have the potential to automate processes, improve risk assessment and management, and enable more accurate and efficient decision-making.It is important to note that the financial sector, particularly banking, accumulates a significant amount of sequential data based on customer behavior and market events.
EvS classification methods are used in various fields, however, the successful utilization of machine learning still requires substantial effort from human experts, as no algorithm can achieve optimal performance on all possible problems.
Most of machine learning research and applications have been centered around core domains: text, images, time-series, and speech.However, EvS differ from these well-studied domains in several ways: (1) events are usually described by both categotical and numerical features, (2) events in an arbitrary sequence are not uniformly spaced in time (an example can be seen in Figure 1), whereas data in core domains is usually uniformly distributed spatially (image pixels) or temporally (speech signals), and (3) elements that are close together in space or time have a shared context, and valuable information can be inferred from their neighboring elements.However, this principle does not necessarily apply to EvS since the events may not occur in close proximity to each other in time.
The properties mentioned above can vary significantly across different datasets.As a result, effectively modeling EvS requires the development of task-specific deep learning architectures.This process involves leveraging domain knowledge and can be labor-intensive due to the iterative nature of trial and error.
Our work aims to develop a NAS procedure specifically designed to effectively handle diverse event sequences.We refer to this approach as SeqNAS (Sequence NAS).
Our contributions and results: • Performance.Our simple yet efficient method SeqNAS shows the superior performance when compared to existing NAS methods and popular architectures that are used for EvS classification.• Search Space Design.We design a novel search space of size ∼ 5 × 10 6 possible architectures.The search space contains multi-head self attention, convolutions, and recurrent cells and is tailored to handle event sequence data.To the best of our knowledge, we are the first to develop and analyze such a search space for event sequence datasets.• Ensemble Of Teachers.Typically, intermediate (suboptimal) architectures obtained during architecture search are thrown away.We propose to utilize them as an ensemble of teachers for subsequent models via knowledge distillation [11].• Benchmark Datasets.To advance the development of event sequence classification methods, we have initiated a benchmark for EvS classification by comparing various models and methods.Our study employs six event sequence datasets that were sourced from online competitions held over time or used by other authors.Our work is the first one to carry out a comparison of EvS classification methods for a diverse list of datasets.We make these pre-processed datasets openly available.
• NAS-Bench Event Sequences.We present a novel neural architecture dataset comprising 3200 trained architectures, each accompanied by its corresponding scores.This dataset can further facilitate development of predictor-based NAS methods.
We provide the source code for the experiments conducted on publicly available datasets, as described in this paper, along with the datasets themselves. 2 Related work

Search Space
The performance of NAS algorithms heavily relies on the search space design, which should exhibit a reasonable degree of flexibility and accommodate established high-performing solutions.A well-designed search space can deliver a satisfactory outcome even with a random search strategy [12].Therefore, search space design is a primary focus of our work.
Although event sequence data are very common in various applications, the majority of NAS algorithms are designed to solve image classification problems and only rarely extend their search procedures to other core domains.
Closest to our domain are works exploring text classification and multivariate time-series classification.To evaluate the performance of our procedure for EvS classification, we included methods from both domains in our benchmark and studied their transferability to EvS.These methods are discussed in Sections 2.2 and 2.3.

NAS for text classification
The most frequently used building blocks in modern deep neural network (DNN) architectures are: 1) Convolutional; 2) Recurrent; 3) Multi-Head Self-Attention; 4) Pooling Layers; and 5) Identity layers.AutoAttend [13] and TextNas [14] have both explored different methods of combining these building blocks as graph nodes to construct a search space.
In TextNas [14], each node is selected from a pool of blocks described above, and each incoming feature (edge) is selected from a pool of nodes from previous layers.This work is the closest one to ours.
AutoAttend [13] introduces a NAS procedure to search for attention representations.The main idea is that K-keys, V-values, and Q-queries may originate as features from distinct layers and that different searchable blocks are used to project each incoming feature.Then, incoming features are aggregated in an attention or an addition layer.
In NAS-Bench-NLP [15], authors search for an Recurrent Neural Network (RNN) structure.The RNN cell is represented as a directed graph in which nodes correspond to specific operations and edges encode their inputs.The authors use four different types of operations in their study, namely: linear layers, element-wise weighted summation, element-wise product, and various activation functions.The operations associated with the nodes and edges are selected during the search process.

DNN for Multivariate Time Series classification
Gated-Transformer-"GTN" [16] uses two attention blocks instead of one, one to model step-wise and another channel-wise correlation between components of Multivariate Time Series.The performance of this approach for EvS is evaluated in Section 4.3.
In ROCKET [17], the authors focus on Univariate Time Series (UTS) classification.ROCKET generates feature representations using several randomly initialized convolutional kernels and then trains a linear layer a top of these features.A comparison of SeqNAS and ROCKET can be found in Table 4.

Search Methods
The architecture of a deep neural network (DNN) can be modeled as a directed acyclic graph (DAG), where the nodes represent operations, and the edges denote incoming or outgoing features.In this way, neural architecture search (NAS) is viewed as an algorithms for discovering a task-specific DAG.
There are various search strategies available, including Bayesian Optimization (BO), Evolutionary Methods, Reinforcement Learning, and Differentiable NAS (DNAS).These methods aim to find the best suitable architecture in the vast space of neural architectures with significantly fewer resources than an exhaustive search requires.
In ENAS [18], reinforcement learning is utilized to train a super-net, which is an over-parameterized architecture allowing for efficient weight sharing among sub-models.This eliminates the need to train each candidate sub-model from scratch, resulting in a significant reduction in search time.Additionally, DNAS [19,20,21,22,23,24] builds upon this idea by representing the over-parameterized super-net as DAG and assigning differentiable importance weights to each edge.The highest edge values determine the selected sub-graph or path.This approach minimizes the need to evaluate multiple models, thereby accelerating the search process.Knowledge distillation without an ensemble of teachers was applied for NAS in [25] for image classification.
However, [26] and [23] demonstrated that the optimal architecture is not always selected using DNAS and that the procedure requires various modifications to perform well.On the other hand, [15,27] have demonstrated that various BO-based methods perform effectively across varied search spaces, including text classification.One such method, highlighted by the authors, is BANANAS [28], which relies on BO and the neural Predictor-model -the model designed to predict architecture performance bypassing full train and validation cycle.The neural Predictor-model is trained on previously queried architectures to score new potential candidates before training them.Unlike DNAS, BANANAS requires training multiple models.However, we show that previously trained models can be used as a practical advantage.This is discussed in Section 3.4.
To train a Predictor-model on a set of architecture-score tuples, it is necessary to have a procedure for architecture encoding.The study [29] propose eight architechure encoding schemes categorized into two groups: adjacency matrix-based and path-based.The authors evaluate the performance of each encoding scheme for different NAS subroutines: Predictor-model training, architecture perturbation and random architecture sampling.They show that no encoding scheme performs well across all subroutines, but the path-based encoding outperforms the adjacency matrix-based one on the task of training the Predictor-model .

NAS Benchmarks
The NAS-Bench series of benchmarks [15,30,31,32,33,34] have made significant contributions to the advancement of scientific research in neural architecture search (NAS).These benchmarks aim to establish a standardized measurement procedure and provide datasets for easy comparison and reproducibility in NAS research.They include datasets of trained architectures and their corresponding scores, along with detailed discussions on the characteristics and performance of various NAS algorithms.However, these benchmarks have not yet explored the domain of event sequence.In our work, we extend the NAS-Bench series with NAS-Bench Event Sequences, a dataset of architectures specifically designed to model EvS.

Temporal Point Processes modeling
Recently, different neural architectures and approaches were used to model EvS as Temporal Point Processes (TPP) [3].These data exhibit complex short-term and long-term temporal dependencies.Existing methods heavily rely on Recurrent Neural Networks (RNNs) due to the sequential nature of event sequences [35].However, RNN units are not effective in capturing long-term dependencies.On the other hand, transformer and convolutional-based models are capable of handling long-term dependencies, but they assume a uniform temporal distribution.To address these challenges, authors in [36] propose a transformer-based architecture that models the dynamics of temporal point processes using a continuous conditional events intensity function.Additionally, in [37], a long convolutional kernel with weights, conditioned on event samples intensity, is prpoposed.This parameterization enables the handling of non-uniformly sampled and irregularly-sampled datasets.3 Methodology

Search Space design
The general layout of all the blocks in our search space is illustrated in Figure 2.There are four main blocks in the search space: • Stem (always present) has a searchable structure depicted in Figure 3.
• Encoder (optional) is searchable with multiple layers, whose structure is depicted in Figure 5.
• Decoder (optional) has a searchable number of layers.
• Head (always present) has a searchable structure depicted in Figure 4.
Here, the term optional denotes that the presence of a particular block is determined during the search procedure.
For instance, the minimal architecture would consist only of the Stem and Head blocks.Now we describe earch architecture block in more details.

Stem
The Stem fuses categorical and numerical features from input data into one vector as depicted in Figure 3.
The Stem pipeline is threefold: • Categorical features are encoded using an embedding layer, and the size of the embedding is automatically determined by the formula: min(600, round(1.6 × N 0.56 )).Where N is a sequence length.• Numerical features are processed using BatchNorm [38]; afterwards, convolution with a searchable kernel size is applied to each numerical input along the temporal dimension, and optionally, dropout may be applied after convolution.• Finally, all embeddings of all types are concatenated along the feature dimension to obtain the input sequence.

Encoder
Encoder brings the most variability into the search space.Common operations are available in the encoder: Multi-Head Self-Attention (MHA) [39], Gated Reccurent Unit (GRU) [40], and Convolution.Each of these operations entails different assumptions about the nature of the data: RNN units assume a sequential nature of the data, convolutional layers are effective at capturing temporally local correlations, and transformers excel at capturing long-term dependencies throughout the entire sequence.Encoder has both a searchable number of layers and the operations within each layer.Input of each Encoder layer is divided into one to three blocks along the feature dimension, which can be processed using one of six potential operations, such as MHA, GRU, or Convolution.The number of heads in MHA is searchable and is chosen from the set {1, 2, 4, 8}.In total it provides up to 19 variations for a single Encoder layer.It is worth noting that each layer has a distinct set of operations.The outputs from each block are concatenated and sent to the next layer within the Encoder.This structure is illustrated in Figure 5.The encoder may have three possible values for the number of layers (1, 2, or 4), resulting in approximately 130 × 10 3 possible Encoder variations.

Temporal encoding
The traditional positional encoding uses the token position in the sequence to obtain the embedding [39].However, in the case of non-uniformly spaced event sequences, this traditional positional embedding is inadequate for capturing the relative arrangement of events.Therefore, in line with the approach proposed in Transformer Hawkes process [36], we utilize temporal encoding to address this limitation.

Decoder
In the decoder block, we utilize a standard transformer architecture with a searchable number of heads and layers without recurrent or convolutional layers used in the encoder.We adopt the transformer variant proposed in [41] with Sub-LayerNorm and same weights initialization.

Head
Head aggregates the sequence of feature vectors to produce the final classification as depicted in Figure 4. First, optional spatial dropout operation is performed [42].Next, the sequence of tokens is aggregated into a single feature vector, aggregation is also a searchable operation consisting from either the maximum operation, averaging, or a combination of both.Finally, the feature vector is projected to obtain the final logits.

The Architecture Vectorizer
To extract the architecture features for a Predictor-model -the model designed to predict architecture performance, we focus on a group of path-based encoders described in [29].According to [29] Path-based encoders outperform the adjacency matrix-based ones for the Predictor-model setting.Our encoding is done as follows.A binary variable is assigned to each block, layer and the particular operation in each layer.If the block or layer is not involved in the architecture, the corresponding binary variable, and all variables responsible for the operations inside the block or layer are set to zero.Finally, all variables are concatenated to obtain the feature vector.For simplicity we call our architecture encoding scheme -AVec.

The search procedure
With the reasoning presented in Section 2.4, we have opted to use a bayesian optimization similar to [28].However, instead of an ensemble of DNNs we employed CatBoost [43] to obtain predictions and corresponding uncertainty estimates.This alteration from [28] has enabled us to leverage the benefits that CatBoost offers over the ensemble of DNNs, a more precise uncertainty estimation following a theoretically justified approach [43].We analyze this choice in Section 4.4.2.The search procedure is outlined in Algorithm 1.There are three main components of the search process: 1) Architecture vectorizer -AVec, 2) Score prediction and uncertainty estimation -Predictor-model as CatBoost, 3) Candidate selection -Thompson sampling.
Initially, a set of N init architectures is randomly sampled from the search space A. After training all of them, actual performance scores are obtained for each architecture.Next, the Spredictor is trained using the architecture features and actual scores.Architecture features are obtained with our architecture vectorizer -AVec.Then, the trained P redictor − model is used to estimate scores and associated uncertainties for new randomly sampled N iter architectures.It is crucial to balance the exploration-exploitation trade-off during the search process.To achieve the balance we use Thompson sampling with estimated scores and uncertainties.L candidates architectures are sampled for further trainig.These steps are repeated until the allocated budget is met as described in Algorithm 1.
For further technical details on each parameter, Section A provides a detailed explanation.

Ensemble Of Teachers
Our search procedure involves training many models.By combining these models as an ensemble of teachers, we are able to leverage the benefits of this approach.In [44], authors demonstrated that a weak teacher ensemble could lead to improved student performance.This observation allows us to construct an ensemble of best-performing models at a current search step.We use an average of different models predictions as a teacher model.Before enabling distillation loss, we train a total of 30 architectures.On each search iteration all models predictions for all training examples are being cached to avoid computational complexity.At each new iteration, we update members of the ensemble by selecting topK best performing models from our cached predictions.We use ensembles of three models in most of our experiments.
To compute the model-to-model loss, we opted for Mean Squared Error (MSE) instead of Kullback-Leibler divergence.This choice allows us to avoid using additional temperature hyper-parameter [45].

Datasets
We utilize six publicly available datasets consisting of event sequences sourced from different data science competitions and prior studies.These sequential datasets were carefully selected to include both categorical and real-valued features.In each dataset, a sequence of events is provided as an input to predict a categorical target, making it a classification task.Detailed statistics and target regarding each dataset can be found in Table 1.
Algorithm 1 P redictor − model -model score predictor with parameters θ, N init -initial number of architectures to train, N iter -number of architectures to sample during each iteration, L candidates -number of architectures to train during each iteration, Ensemble -ensmbling function, M -number of iterations, X, S -predicted scores and corresponding uncertainties.
1: K 1 ← Sample(N init ), sample random architectures from the search space.2: T rainedArches ← T rain(K 1 ).Train all architectures in K 1 and obtain their scores, X is a set of scores from all trained models, X i are scores for a current iteration.3: ArchF eatures ← AV ec(T rainedArches), encode architectures into features .K 1+i ← Sample(N iter ), sample random architectures from the search space such that K i+1 ∩ T rainedArches = ∅.

8:
Select L candidates architectures from K 1+i with Thomson sampling using obtained uncertainties S.

9:
T ← topK(T rainedArches), select the best performing teacher models from already trained models and obtain an ensemble of teachers Ensemble(T ).

10:
Train all models in L candidates with distillation loss and Ensemble(T ) and obtain actual scores X i .

14:
Update a score predictor model θ ← arg min θ L(θ), where L(θ) = M AE(P redictor − model(ArchF eatures; θ), X). 15: end for 16: Select the best architecture from T rainedArches according to some performance metric.Bank transaction data: VBank, AmEx, AGE, RBchurn and ABank datasets consist of card transactions, financial records, and other user-related data.Leveraging these datasets, we utilize event sequences to predict specific targets such as default events, churn, user higher education, age and etc. Mainly each transaction is characterized by its date, type, amount, and Merchant Category Code.
Taobao: Taobao dataset is a subset of the Taobao APP user behavior data, comprising millions of items recorded over one month.The dataset is organized in a user-item interaction format, consisting of user ID, item ID, category ID, behavior type, and timestamp.
To suit the context of our task, we preprocess the dataset by excluding the item ID for simplicity.Additionally, we merge all categories that appear less than 500 times in the dataset into a single category.This preprocessing step allows us to reduce the number of unique categories from 8,900 to 1,900.We focus on the client's behavior within a 7-day window to predict whether they will make a payment in the following 7 days.
No manual feature generation or preprocessing was conducted on most of the datasets, except the Taobao and AmEx datasets.In the case of the AmEx dataset, we utilized a cleaner version obtained from the Kaggle competition platform.
To create train, test, and validation sets, we performed a random split for each dataset.The split ratios were set to 0.6 for the training set, 0.2 for the test set, and 0.2 for the validation set, based on the total sample size.Sequences shorter Table 2: Comparison of our method with two NAS procedures 1)AutoAttend [13], 2)TextNAS [14] and four fixed architectures 3)Gated Transformer Networks [16], and baseline models such as 4) Fixed Transformer, 5)GRU, 6)LSTM.We report MEAN and STD of the 3 best models found, for both HPO and NAS procedures.We mark the First and the Second best performing models as highlighted in this text.than specified in Table 1 were padded with zeros; for sequences longer than specified, we took N last events, where N is the sequence length specified.

Methods
We compared our results with two NAS approaches, namely AutoAttend [13] and TextNAS [14].However, we had to make some modifications to adapt these methods for the EvS domain.These modifications are described in Appendix A. Furthermore, we also compared our approach with fixed architectures such as GRU, LSTM, and Transformer.To ensure optimal performance, we performed Hyper-Parameter Optimization (HPO) for all fixed architectures.The details of the HPO can be found in Appendix A.

Results
Our main results are presented in Table 2. SeqNAS outperforms all other methods.The second place is shared by fixed architectures: Transformer and GRU.We further analyze importance of MHA and RNN as building blocks of our search space in Section 4.4.3 and show that all blocks are complementary to each other.
Our experiments show that search spaces from related domains, such as text, do not always transfer well to EvS and sometimes underperform even simple models such as RNN.
A potential disadvantage of SeqNAS is its longer training time compared to other approaches, as shown in Table 5.However, as discussed in Section 1, it is still a reasonable time complexity given significant gains for various applications.

Ensemble Of Teachers.
In Figures 6 and 7 we demonstrate the results of the search procedure with and without Knowledge Distillation (KD) on two datasets, AmEx and RBchurn correspondingly.We show the average scores of the top 3 models over the search steps.It can be seen that KD significantly improves performance metrics for both datasets.The same observations can be seen for other datasets in Table 3, where we demonstrate the final metrics for the single best architecture using a random search procedure.
We experimented with different approaches to (1) select diverse teachers and (2) combine their predictions into an ensemble.We found that averaging the predictions of 3 best performing models resulted in the best performance.
During training, we had access to hundreds of trained models, and we believe there are further opportunities for improvement and potential applications in this direction.

CatBoost vs. DNN predictior
We evaluate two types of models for architecture scoring and uncertainty estimation (P redictor − model) in Figure 8.
The results demonstrate that the model based on CatBoost outperforms the one based on an ensemble of DNNs with slightly superior performance.We used an ensemble of eight models.To better understand the roles of MHA, GRU, and Convolutions in the encoder layers, we conducted a search procedure where we removed one of these blocks at a time.As shown in Figure 9, models without GRU block exhibited a significant drop in performance compared to those with all three blocks present.Unsurprisingly, these results can be explained by the good performance of GRU model alone presented in Table 2. Nonetheless, it is worth noting that the relative importance of each block varies depending on the dataset used.

Random search vs SeqNAS
In Figures 10 and 7, we compare SeqNAS and Random search procedure for two datasets, AmEx and RBchurn correspondingly.In our settings, we first train 100 randomly sampled architectures and then fit P redictor − model to score new candidates.It can be seen that SeqNAS starts to outperform Random search approximately after 100 steps for both datasets.

NAS-Bench Event Sequences
Analogous to the precomputed NAS benchmarks [15,31,32], we release a dataset with trained and evaluated architectures.The details can be found in Appendix C.

Discussion
• The search space of TextNAS contains same operations as SeqNAS.However, TextNAS differs from Seq-NAS in terms of graph topology and search procedure.TextNAS performs worse than both SeqNAS and fixed architectures with HPO, raising questions about the impact of search space design or search method, such as ENAS.
• It's important to note that SeqNAS does not outperform ROCKET for univariate time series classification.Many recent methods that perform well on UCR datasets utilize specific convolutional operations or fixed fea-ture generation.Thus, our search space could benefit from incorporating new searchable operations proposed in various works for UTS classification.
• There are potential improvements that can be made based on TPP modeling works, such as better estimation of event densities [36] or the use of temporally parameterized long convolutions [37].
• While our results highlight the importance of the GRU unit in EvS modeling, this observation may be specific to our datasets.It's possible that with larger datasets, Transformer-based blocks may offer better performance.
Different training strategies, like applying causal masks, may also help improve Transformer performance.
• Currently, our method may produce over-parameterized models without considering hardware constraints.Further improvement can be done to develop a more computationally efficient approach.
• The literature currently lacks a NAS procedure that features a broad search space suitable for a wide variety of tasks, referred to as the Universal Search Space.Moreover, we see that the generalization ability of existing search methods is limited even across similar domains.

Conclusions
In this paper, we introduce SeqNAS, a novel method for automatically searching neural architectures specifically designed for EvS data.Our approach outperforms other NAS methods and standard architectures with hyper-parameter optimization in the EvS domain.We demonstrate the versatility of our method by applying it to various datasets.To the best of our knowledge, our work represents the first extensive exploration of NAS for EvS.
We show that in our search space different types of operations complement each other, leading to the discovery of improved architectures.There is no architecture which performs better without one of the operations: MHA, RNN unit, or convolution.
Our approach combines knowledge distillation with sequential Bayesian Optimization to achieve significant performance improvements in a computationally efficient way.
Additionally, we establish a benchmark for EvS classification by comparing different models and techniques.This benchmark can serve as a valuable resource for researchers looking to advance the field of EvS classification.
We release the NAS-BENCH Event Sequences dataset, which includes architectures and corresponding scores, to support research on predictor-based NAS methods.

A Techical details
In SeqNAS, for all of our datasets, we used the following hyper-parameters parameters: N init = 100, N iter = 100, M = 40 and L candidates = 15.For more details regarding each dataset, please refer to our repository.
To evaluate various NAS methods, including TextNas, AutoAttend, and GTN, we used the hyper-parameters and search procedures outlined in their original papers.
For TextNas and AutoAttend, we incorporated ur Stem blocks to combine real-valued and categorical features.Similarly, for GTN, we added embedding layers for categorical features but chose not to utilize convolutional layers for real-valued features, as they were not utilized in the original paper.
For fixed transformer architecture, we used a simple model with two MHA layers in both Encoder and Decoder, with 8 heads in each.LSTM and GRU models consisted of only one RNN layer, with Stem and Head blocks.Fixed models were optimized using hyper-parameters optimization.
For all models, we used the identical fixed structure for Stem and Head blocks described in our architecture.In the Stem block, we did not use dropout and set fixed convolutional layer kernel size to 3 × 3.For Head, we fixed the spatial dropout rate at 0.3, and for pooling we used Max pooling.

B Additional experiments with time-series classification
Additionally, we assessed the performance of our classification method on UTS (Univariate Time Series) using datasets obtained from the UCR archive [48], specifically the InsectSound and ElectricDevices datasets.Comprehensive descriptions of these datasets are publicly available through the UCR archive.We evaluate the performance of SeqNAS on Univariate Time Series (UTS) classification against two datasets from the UCR archive [48].Our results, as displayed in Table 4, demonstrate that SeqNAS produces reasonable results for UTS classification due to its flexible search space.However, it should be noted that SeqNAS performance suffers when used with small-size datasets from the UCR archive.

C NAS-Bench Event Sequences
Our dataset consists of 3200 architectures obtained on six different datasets.Out of the total architecture pool, 800 architectures were randomly queried from our search space, while 2400 architectures were queried using our search procedure.The distribution of these architectures across datasets and methods can be found in Table 6.For each architecture, we provide the best score achieved across all epochs, along with its feature vector encoded using AVec.The corresponding metrics for each score are listed in Table 2.All architectures within a dataset were trained for an equal number of epochs.

Figure 2 :
Figure 2: The general layout of our search space.Dotted borders indicate that blocks contain searchable operations.Dashed lines indicate that connections between nodes are searchable.The solid line is an example of selected architecture.

Figure 3 :
Figure 3: Searchable part of Stem block is depicted with dashed and dotted lines.Convolutional layers with different kernels and the presence of dropout are selected at each search step.A solid line is an example of a selected path.

Figure 4 :
Figure 4: There are two searchable pooling layers in Head Block: Max pooling and Average pooling.The type of a pooling layer and the presence or absence of spatial dropout are determined by the search procedure.A solid line is an example of a selected path.

Figure 5 :
Figure 5: Encoder Layer with searchable MHA, GRU and conv operations.A combination of one, two, or three operations can be selected during each search step.Different combinations are selected on different layers.Incoming features are divided into several selected operations.An example combination with MHA, GRU and conv operations is depicted with solid lines, and an example combination with MHA and conv operations is depicted with dashed lines.Dotted border around MHA indicates that it has a searchable number of heads.

Table 1 :
Statistics of sequential datasets used for our analysis.

Table 3 :
[28]re6: Search performance of SeqNAS with and without KD on AmEx dataset over a number of trained architectures, for 200 architectures in total.Results are averaged over 3 best performing models as a sliding window.Performance is measured with a metric specified in Table2for AmEx dataset.KD is employed after 30 models were trained.It can be seen that lines start to diverge approximately after 60th iteration.Comparison of SeqNAS, Random search with KD and Random search without KD on RBchurn dataset over a number of trained architectures, for 400 architectures in total.Results are averaged over 3 best performing models as a sliding window.Performance is measured with a metric specified in Table2for RBchurn dataset.KD is employed after 30 models were trained.Figure8: Comparison of different predictors on AmEx dataset: CatBoost[46]and an ensemble of DNNs originally proposed in BANANAS[28].Results are averaged over 3 best performing models as a sliding window.Performance is measured with a metric specified in Table2for AmEx dataset.Effect of knowledge distillation on search score using random search.
Search performance of SeqNAS without different blocks in Encoder layer and with all blocks included -ALL on AmEx dataset.Results are averaged over 3 best performing models as a sliding window.Performance is measured with a metric specified in Table2for AmEx dataset.We see that different types of operations complement each other.Figure 10: Search performance of SeqNAS and Random search with KD. Results are averaged over 3 best performing models as a sliding window.Performance is measured with a metric specified in Table 2 for ABank dataset.Both SeqNAS and Random search use the ensemble of teachers.

Table 4 :
Comparison of UTS classification for datasets from USR archive with ROCKET, ROC-AUC is computed for both datasets.

Table 5 :
Approximate search cost in GPU hours for SeqNAS, TextNAS and GRU with HPO.SeqNAS used 400 iterations.TextNAS uses ENAS for architecture search, which is significantly faster.For HPO details, we refer to

Table 6 :
In this table, we demonstrate the distribution of architectures among various datasets and methods.The architectures are presented based on our final search procedure as well as randomly queried ones.