NeSyChair: Automatic Conference Scheduling Combining Neuro-Symbolic Representations and Constrained Clustering

Creating the schedule for an academic conference is a time-consuming task. A typical conference schedule consists of sessions containing papers addressing the same research topic. To construct a schedule, conference papers must be grouped according to their research topic, and the obtained groups should fit the assigned time slots. This paper proposes an approach to automating the schedule-creation process. We use multilingual, neuro-symbolic paper representations and novel constrained clustering to group papers into clusters of predetermined size with the same topic fitting the schedule structure. In the process, we combine machine-learning, natural language processing, network analysis, and combinatorial optimization. We tested the components of the proposed approach on a newly created database of papers from six machine learning conferences, which were manually labeled by their research topics. The entire system was tested on two real-world conferences in a multilingual setting. The developed methodology is incorporated into an interactive automatic conference-scheduling system NeSyChair (Neuro-Symbolic Conference Chair), which can be used to create and improve conference schedules.


I. INTRODUCTION
O RGANIZING a scientific conference is a timeconsuming task, and so any automation would be welcomed by the conference organizers. While there are several systems that assist organizers in managing the review process, the scheduling of paper presentations remains largely a manual task. A typical conference schedule consists of multiple plenary and parallel sessions, consisting of paper presentations covering the same or similar research topics. To construct a conference schedule, papers are usually first grouped according to their topics and then assigned to available time slots. Doing this manually is time-consuming, as large conferences have presentations of hundreds of papers that address many research topics, while authors, organizers, and venues can impose additional constraints.
Along with the papers consisting of text and metadata, e.g., authors and keywords, conference organizers might have access to additional metadata, often available in the form of networks. Examples of such data are citation networks, where two papers are connected if one of them cites the other, or co-bidding networks, where two papers are connected if the same reviewer bids to review them. This additional information can be useful for finding papers with similar research topics, provided that the information is first converted into a vector form that is suitable for automatic data processing by machine-learning (ML) algorithms, such as clustering.
We propose an automatic approach for the assignment of papers into conference slots using ML, natural language processing (NLP), network analysis, and combinatorial optimization. Our goal is to construct a schedule where the papers from the same research topic are not presented at the same time, enabling participants to attend the presentations of all the papers from their research field. To achieve this goal, we test several semantic-similarity techniques to identify the papers with similar research topics and propose a novel constrained clustering algorithm that assigns papers to a predefined schedule structure so that each time slot contains similar papers.
While several conferences require the authors of papers to specify the research field during the submission, these fields are often imprecise and do not accurately reflect all the topics, especially in large research fields with many areas of research (e.g., deep learning). Therefore, an automatic neurosymbolic approach, combining numerical features obtained by the neural embedding of paper texts and symbolic features obtained from paper meta descriptors and the accompanying metadata might be better at grouping papers into relevant topics. Our approach first extracts the relevant information from the papers and the accompanying metadata, present in the form of a graph. We use neural embeddings to transform a paper's text into a numerical vector format and extract additional symbolic features describing the useful properties of the paper's content. Network-based metadata are also represented as numerical vectors using network-analysis methods. Text-and network-based data are then fused into a joint representation, which is used to find similar papers with the proposed constrained-clustering algorithm.
To detect similar papers, we tested several modern neural word-, sentence-, and document-embedding approaches. Embeddings transform the text into numerical vectors so that the vectors contain their semantic information. Recent wordembedding approaches, e.g., ELMo [1] and BERT [2] are known to perform well in a variety of NLP tasks. As these embeddings are computed on large unlabeled text corpora, they are suitable for our task, where large labeled datasets do not exist. In our experiments we use three embedding approaches, i.e., doc2vec [3], BERT [4], and Universal Sentence Encoder [5], to obtain embeddings for either entire papers or n-character-long sections. The proposed representation is cross-lingual and can work in either monolingual or multilingual conference scenarios. We show that the proposed embeddings-based approach is suitable for clustering and classifying papers according to their topics.
The embedded papers are assigned to conference slots using a novel variant of the k-means clustering algorithm that uses combinatorial optimization to ensure the resulting clusters fit the schedule structure and minimize the overlap of paper topics. We evaluate several variations of the clustering algorithm on a synthetic dataset and show that it is capable of generating meaningful clusters with various constraints in the conference schedule. We evaluate the final NeSyChair (Neuro-Symbolic Conference Chair) system approach on a novel, manually labeled dataset consisting of papers from several machine-learning conferences and another multilin-gual conference from the area of natural language processing. To the best of our knowledge, no system comparable to NeSyChair currently exists.
The contributions of this work are as follows. 1) We developed a unified, multilingual, end-to-end approach to automated conference scheduling using a combination of ML, NLP, and network analysis. 2) We developed an approach that uses combinatorial optimization methods to generate clusters of similar papers, followed by a specialized optimization algorithm that applies size constraints on produced clusters to ensure that the returned sessions match the structure of the conference. 3) We made available a manually topic-labeled, English dataset containing papers from several ML conferences, data from a large, real-world ML conference, and a similar smaller dataset in Slovene (a lowresource language). 4) We evaluated our system on a synthetic dataset as well as on two real-world scenarios and show that our approach works for conferences in languages other than English. The paper is structured into four subsequent sections. The related work is presented in Section II, with the proposed methodology for automatic conference scheduling in Section III. In Section IV, we present the evaluation setting, the analysis of the proposed components, and the evaluation of the entire system. Section V draws the conclusions and presents the ideas for future work.

II. RELATED WORK
There are several conference-management systems that assist conference organizers in organizing and managing conferences. These include EasyChair [6], OpenConf [7], Microsoft Conference Management Toolkit [8], IAPR Conference Management System [9], and EDAS: Editor's assistant [10]. Such systems assist organizers in the paper submissionand-reviewing process, but, in contrast to our system, they do not facilitate automatic conference-schedule construction.
In this section, we briefly review the three main components of the proposed conference-scheduling approach: paper similarity measures, graph-based features for paper similarity, and constrained-clustering methods that can be used to assign papers to a conference schedule.

A. PAPER SIMILARITY
In the area of information retrieval, text similarity is extensively researched [11]. A classic approach is to use a sparse text representation approach, such as bag-of-words [12], to represent a given collection of documents and apply a similarity measure, e.g., cosine similarity, to compute the pairwise document similarity. Hurtado et al. [13] use language models generated on abstracts to find similar papers. Besides the full text of abstracts and papers, some authors use additional metadata like keywords to improve the results [14].
Another approach, well-suited to determine the similarity of research papers, is based on terminology extraction. Research papers contain a large amount of scientific terminology specific to their field. Since the terminology is closely linked to the papers' topics, focusing on the terminology instead of words from the general vocabulary can be useful. Milios et al. [15] show that using automatically extracted terminology works well for finding similar papers. Jiang et al. [16] describe an approach for terminology extraction from research papers using keywords and title words as a basis for the terminology and extend their approach by finding similar words. In the work described in this paper, we use the extracted terminology in combination with other features and improve the extraction process with rule-based filters that remove terminology that is either too specific or too general to identify the papers' topics.
Recently, approaches based on dense word embeddings have prevailed. Embeddings such as continuous bag-ofwords (CBOW) and skip-gram models implemented in the word2vec tool [17] are capable of extracting semantic information from words and mapping words with similar meaning to similar vectors, which can help in determining text similarity. Recently, contextual word embeddings were shown to improve performance in a variety of natural language processing tasks. Such models include BERT [2] and ELMo [1] embeddings, as well as embeddings produced using the transformer neural architecture [18]. Such models achieved state-of-the-art results on a variety of NLP tasks. However, most of these approaches require large training datasets to achieve a competitive performance. Despite this, Turc et al. [4] show that a model that is pre-trained on a large unlabeled dataset can achieve good results on a variety of tasks, even when using a small amount of task-specific, labeled data. Nevertheless, such approaches are designed to produce embeddings for smaller units of text, such as words or sentences, which makes them unfit to compute the similarity between entire documents.
Several authors explored ways of obtaining embeddings for larger sequences of text, such as sentences or entire documents. Sitikhu et al. [19] present a variety of approaches for embedding larger units of text. Such embeddings can be directly used for detecting similar papers. In our work, we show that both word embeddings (BERT, word2vec) and document embeddings (doc2vec [3], universal sentence encoder [5]) can be used to find similar papers.
An additional benefit of text embeddings is their ability to handle text in multiple languages using cross-lingual text representations. Lample et al. [20] show that it is possible to align text embeddings from multiple languages into the same vector space so that words with similar meanings from multiple languages are grouped together. This allows downstream approaches to handle multiple languages without the use of an explicit translation. We exploit this property to build and test a system that can construct schedules in multiple languages.

B. NETWORK ANALYSIS FOR PAPER SIMILARITY
Frequently, pairwise similarities of documents, computed, e.g., by the cosine similarity measure, are used to construct a network of similar papers, followed by their analysis using network analytic techniques. In citation networks, used by Price et al. [21], the nodes are individual papers, and each paper is connected to every paper it cites. In networks of bibliographic coupling [22], two papers are connected if they cite a common paper. In networks of co-citations [23], two papers are connected if they are both cited by the same paper. Giles et al. [24] present an algorithm called CCIDF (Common Citation × Inverse Document Frequency), which improves citation networks by weighting citations with the inverse frequency of citations in the entire database. In addition to networks constructed from citations, other networks can be used to find similar papers. Hamasaki et al. [25] show that interpersonal networks can be useful for this task.
For analyzing conference papers in the context of constructing a conference schedule, citation-network approaches cannot be used, as the papers that are to be presented at a conference have usually not been publicly available before the event and are therefore unlikely to cite each other directly. Nevertheless, several works [26,27,28] show that social information expressed by conference participants can be used. In contrast, with our approach, these works use the networks independently and do not explore how they can be combined with features extracted from the text to better find similar papers.
In our previous work [29] we used a network constructed from paper-bidding preferences expressed by the reviewers in the paper-bidding phase, where each reviewer marks the papers that he/she would like to review. In the resulting network, two papers are connected if the same reviewer expressed a preference to review them. We presented an automatic conference scheduler using this network in combination with the TF-IDF (term frequency-inverse document frequency) weighted vectors of paper abstracts. In this paper, we use several additional NLP methods such as terminology extraction and neural text embeddings to extract useful information from the text and additional graph-based data such as bibliographic coupling networks. We transform graph-based data into the vector form using both Personalized PageRank [30] and node2vec [31] algorithms and compare the results. Furthermore, we select the best features using feature-subset selection methods and extensively evaluate the individual components, as well as the final automatic scheduling algorithm as opposed to relying on expert evaluation and silhouette score used in our initial work [29].

C. DOCUMENT CLUSTERING
Several authors have shown that standard documentclustering methods, such as k-means [32] and mean shift [33]), can be improved with feature-extraction approaches that select the optimal features to be used during clustering. Abualigah [34] shows that feature selection with a modified Krill Herd algorithm can improve the k-means document VOLUME 4, 2016 clustering. Abualigah and Khader [35] and Abualigah et al. [36] present a similar approach using the particle-swarm optimization. The authors show that selecting a limited subset of features can improve the results compared to using all the available features. We show that the same applies to our case. However, the above approaches do not take into account the limitations on cluster sizes. When clustering articles to generate a conference schedule, cluster sizes have to match the sizes of the plenary and parallel sessions at the conference. Several authors propose constrained-clustering methods for conference-schedule generation that aim to solve this issue. Vallejo et al. [37] present two algorithms, Clustering Algorithm with Size Constraints and Linear Programming (CSCLP) and K-MedoidsSC, which could produce clusters that match a predefined conference schedule. CSCLP treats clustering as a global optimization problem, while our approach uses local optimization to ensure good clustering while satisfying the size constraints. K-MedoidsSC starts with a clustering returned by the K-Medoids algorithm and then rearranges the clusters to satisfy the size constraints. In our approach we start with the results of the K-Means algorithm, modified to satisfy the size constraints and use local optimization to improve the results after the size constraints have been satisfied. Kalmukov et al. [38] present a similar approach using hierarchical clustering, which does not produce clusters matching the conference schedule, but is specifically designed to cluster the conference papers. Kudo et al. [39] address the problem of additional constraints that can be present when constructing conference schedules, e.g., two paper presentations given by the same author cannot occur at the same time.
Unlike our approach, no existing work presents a complete system for automating conference scheduling. We present an end-to-end approach that combines feature extraction with a constrained-clustering algorithm to generate a complete conference schedule. The system is available as a web application. We present a comprehensive evaluation including synthetic and real-world datasets, real-world conference schedules and multiple languages.

III. PROPOSED AUTOMATIC CONFERENCE SCHEDULING
A conference program usually consists of workshops, tutorials, invited talks, and conference tracks. We focus on constructing an automatic schedule of the conference tracks, where accepted research papers are presented. The presentations are grouped into sessions, where each session contains presentations of the same or similar topic. In larger conferences, multiple sessions run simultaneously and will cover different topics to allow attendees to choose a topic of interest to them. An entire schedule consists of multiple sessions, usually spread across several days. In our approach we assume that the number of sessions, their duration, as well as their distribution in the schedule are defined in advance. The schedules depend on external factors, such as the number of rooms available at a venue. Our software supports the ability to create a new schedule or to reuse previously defined schedules from the same conference series. The automatic assignment of papers to sessions has to take into account that the number of papers across topics varies and might not correspond well to the available sessions. Additionally, many papers address more than one research topic.
In this section we present our approach to the automatic construction of conference schedules. In Subsection A we describe classic methods to find similar papers. We use the extraction of textual features, feature weighting, and terminology extraction. In Subsection B we describe our application of neural text embeddings, with the focus on word, sentence, and document embeddings. We then extract the information from metadata, such as citation networks and co-bidding networks, with Personalized PageRank and node2vec vectorizations (described in Subsection C). In Subsection D we describe a novel, k-means-based clustering algorithm that ensures that the returned clusters fit the predefined schedule structure and maximizes the distance between parallel sessions. Our approach takes into account that papers might belong to more than one topic and minimizes the misplacement error using a local search.

A. PAPER-SIMILARITY APPROACHES AND TEXT EMBEDDINGS
The classic approaches to text similarity describe each text with a variety of features. In our work we extract features from two types of data sources: the text content of the papers and the networks constructed from the metadata, such as citations. Fig. 1 gives an overview of the process of obtaining vector representations of papers. Although modern, neural-network-based text representations are significantly more successful at the word-and sentence-level, classic document-level approaches are still competitive [40]. We first shortly outline the bag-of-words (BoW) representation with TF-IDF weights, followed by terminology-based approaches that contain highly relevant information for domain-specific documents such as scientific papers.

Bag-of-words and TF-IDF Weights.
A standard approach to presenting text-based information in a sparse vector form is to use BoW document representation with TF-IDF weights. We first construct a BoW vector for each document, with a dimension equal to the vocabulary size (i.e., the vector is sparse). Each vector component consists of two parts: the term frequency TF(t, d) equals the frequency of the word t in the document d, and the inverse document frequency IDF (t) of word t is calculated as: where N is the number of documents in the corpus and n t is the number of documents containing the word t. The final value is calculated as TF-IDF(t, d) = TF(t, d) · IDF(t). TF-IDF gives larger weights to words that only appear in a small number of documents and lower weights to words appearing FIGURE 1. Overview of our NeSyChair automatic conference-scheduling system, consisting of text and graph representation construction, feature selection, and constrained clustering. The four main components of the system are indicated with different colors. We extract features from the text and network metadata concatenating them into a single vector. Most of the features in the vector are text-based, as shown in Table 4. The final set of features is selected using feature-selection techniques.
in many documents. This is based on the intuition that rare words are characteristic of a certain topic. We use the TF-IDF representation of documents as a baseline approach, as it usually produces reasonably good results. A disadvantage of the BoW representation is that words are represented as independent dimensions, which means that the semantic similarities between words are not captured. For example, in the BoW representation, the words run and running are treated as different vector components even though they are semantically similar. Such information can be retained with dense word embeddings, which transform words into dense vectors in such a way that the distances between vectors describe the semantic relations between the words. We present dense word, sentence, paragraph, and document embeddings below.

Word and paragraph vectors.
Some of the most popular word-embedding methods are the continuous bag-of-words (CBOW) and skip-grams models implemented in the word2vec tool [17]. Both approaches train a shallow neural network and extract network weights as word representation vectors. CBOW trains a neural network model to predict a word based on its preceding and succeeding words (the context), while the skip-gram model predicts the context based on the word. The weights of the trained neural network connected with a particular input word are used as its word embedding. As semantically similar words appear in similar contexts, they are assigned similar vectors that contain information about the semantic relationships between words. For example, if trained on a large enough dataset, vector operations vector('Paris') -vector('France') + vector('Italy') result in a vector that is close to vector('Rome'). The result of the operation vector('Paris') -vector('France') represents the relation between a country and its capital city.
While word embeddings are useful, we cannot use them directly to measure the similarity between documents (i.e., papers) since the resulting vectors represent individual words and not entire documents. We can represent a document by taking an average of all the word embeddings as the document embedding. In our case this method did not produce good results concerning document similarity, as the resulting paper vectors were all close to each other. Another approach is to produce embeddings of larger text units. One such approach is doc2vec [3], which extends the CBOW and skipgram models to paragraphs, sentences, or entire documents (instead of words). Unlike training a model to predict the following word based on its context, they train a model to predict a word based on a unit of text of variable length, which allows them to create embeddings for larger units of text, such as sentences or paragraphs. We used this approach to construct document vectors from the full text of each paper. VOLUME 4, 2016 Universal sentence embeddings.
In addition to doc2vec embeddings [3] we also used the universal sentence encoder (USE) [5] to obtain text representations. USE uses the state-of-the-art transformer neural network architecture [18], which is well-suited to obtaining contextual information from texts. Cer et al. [5] compute embeddings not only for sentences, but also for paragraphs and larger units of text. First, word embeddings are obtained from the embedding layer in the transformer architecture. Then, sentence or paragraph embeddings are obtained by computing the element-wise sum of word representations at each word position. This approach was not designed for long texts such as scientific papers and gives poor results when generating a single embedding vector from the entire paper. To remedy this, we split papers into smaller, n-characterlong texts. After calculating the vectors for these chunks, we obtain the vectors for full papers using one of the following two approaches: 1) the vectors for full papers are obtained by averaging these vectors, and 2) in the text-classification setting, each of these vectors is classified individually and the median prediction is taken as the final result.
The first approach was used to obtain vectors for larger text chunks with the word2vec approach [41]. The second approach is not commonly used, but is motivated by the fact that scientific papers can contain sections of text that are hard to categorize, such as equations and tables. By averaging the vectors, such sections might negatively impact the final representation vector. Since such sections rarely compose the majority of a paper, classifying each vector individually and using the median of predictions as the final result can be considered as a de-noising method. A downside of this approach is that it is not suitable for clustering, so it cannot be used when constructing a conference schedule and is therefore only useful to check whether the embeddings contain useful semantic information. We also use the second approach to classify papers using BERT embeddings.
To find the optimal length of text chunks, we tested the performance for chunks of 1024, 2048, 4096, and 8129 characters using an internal 10-fold cross-validation on the training set (see the description of our datasets in Section IV-B). We used the embeddings calculated on the chunks to classify the papers into their research topics using a manually labeled dataset. The classification accuracy of this evaluation is presented in Table 1. The results show that we obtain better results by splitting the texts into smaller chunks. Averaging the vectors works better than using the median of predictions for every chunk size. We obtained the best results with chunks of 2048 characters.

Fine-tuned BERT embeddings.
There are several embedding approaches based on transformer neural network masked language models, such as BERT [2]. These language models are first pre-trained on large amounts of unlabeled text and then fine-tuned on a specific task. The approach produces state-of-the-art results on a variety of tasks, even with only a small amount of task-specific data [4]. In a classification setting, we follow the approach presented by Turc et al. [4]. We start with the original pre-trained BERT model [2] and append to it a single dense layer with the number of hidden neurons equal to the number of class values. We fine-tune this model on our training data using the AdamW optimizer [42]. Due to the large size of the BERT model and the memory limitations of the GPUs, we cannot produce BERT embeddings for the entire text. Therefore, we split each document into chunks of 512 tokens, which matches the length the model was pretrained on.
We fine-tune the pre-trained model on our manually labeled paper datasets to classify the papers into their research topics. This updates the weights used in embeddings, capturing more semantic information relevant to the task. The paper datasets are described in Section IV.
We obtain the final document classification by splitting a document into chunks of 512 words, classifying each chunk, and taking the median prediction of each part. Unlike with USE, for BERT, representing the entire document with the mean vector of all the chunks produced worse results than taking the median of the prediction.
One of the approaches to identifying the topics of a research paper is to use terminology extraction to identify phrases in a given text that are specific to a certain field. Ignoring words found in a general vocabulary and focusing on domainspecific words might better determine the topics of a paper. A standard terminology-extraction approach is to use languagespecific rules identifying terminology candidates and use statistical approaches to select the best ones [43]. However, this requires language-specific rules and does not generalize well across languages.
Terminological expressions can be extracted using contextual word embeddings in an unsupervised manner [44,45]. Pre-trained contextual embeddings are available for many languages and can improve the performance in many NLP tasks [2]. As they can be used for unsupervised keyword extraction, additional task-specific training is not required, making such approaches language-independent.
We use the approach of Bennani-Smires et al. [45] that computes contextual embeddings for individual words and multi-word phrases. It identifies phrases relevant to the topic of the document by comparing phrase embeddings to the embedding of the entire document. Words with similar embeddings to the document are selected as keywords. Additionally, the procedure ensures that selected keywords are sufficiently different from one another by measuring the distance of their vectors. This ensures that the selected keywords are suitably diverse.
We adapted the approach for the specific task of automatic scheduling. As our goal is to form small groups of related papers to match the sessions in the conference schedule, the terms that appear in too few or too many papers are not helpful. We removed the candidates that appeared in less than three papers or in over 15% of all papers. We also filtered out some frequent errors, such as the candidates containing common journal or conference names, terms containing only a single letter (for example, terms like graph G or Matrix M), and terms shorter than five letters, which appeared due to errors when converting papers from PDF to a raw text format.
We also checked whether limiting the number of extracted terms would improve the overall performance. Extracted keywords were ranked using a statistical approach described by Penas et al. [43], who grade the terms based on: • the relative frequency of the candidate term in research papers, F r (t), • the relative frequency of the candidate term in a general corpus, F g (t), and • the number of papers the candidate term appears in, D(t).
The terminology score of candidate term t is calculated as: Score S(t) decreases with the relative frequency of the terminology candidate t in a general corpus F g (t), and increases with the relative frequency of the candidate term in research papers F r (t) and the number of papers the candidate appears in (D(t)). The terminology candidates are ranked by S(t) and we select the top-ranked candidates as terminology features. Our analysis, presented in Section IV, shows that using a small number of top-rated terms (i.e., 100) better determines the similarity of papers than using a large number of terms.

B. GRAPH-BASED FEATURES
Useful information about paper similarity is contained in the papers' metadata, which can be extracted by presenting papers in a graphical form and using network-analytic approaches. Commonly used metadata are citation networks, where nodes represent papers. Paper u is connected to paper v if u cites j. Such graphs are useful when searching for similar papers [46]. Since citations between papers presented at the same conference are rare, we construct a graph based on bibliographic coupling [22], where papers u and v are connected if they cite a common paper. We weigh the connections in this graph according to the number of citations the connected papers share. The assumption behind these connections is that papers that share citations are more similar to each other than papers that do not. Conference organizers have access to additional metadata that can be useful in determining the similarity of papers. An example of such metadata are the preferences expressed by reviewers in the bidding phase of the paper-evaluation period, when reviewers are asked to bid for submitted papers they would prefer to review. Since reviewers prefer reviewing papers from their own field of research, the papers that the same reviewer bids for are likely to be similar. We captured reviewers' preferences by constructing a co-bidding graph [29], where two papers are connected if the same reviewer expressed a preference to review them. We weighted the connections between the papers with the number of reviewers who expressed a preference to review the connected papers.
For graph-based similarity information to be used together with other extracted features, it has to be embedded in a vector form. We used two embedding methods-Personalized PageRank [30] and node2vec [31]-and applied them to the bibliographic coupling graph and co-bidding graph.

Personalized PageRank (PPR).
This algorithm was originally designed to rank nodes by importance in a network of web pages. It constructs node representation using a random surfer model, where a web surfer moves through the nodes by either randomly following hyperlinks or returning to the starting set of nodes with a small probability. Using this model, the PPR algorithm calculates the probability of a surfer reaching every other node starting from a specific set of nodes. These probability distributions, calculated separately for each node in a network, represent the nodes' embeddings. We can calculate the Personalized PageRank score for every node using the following equation: where R(u) is the probability of reaching node u, B u is the set of all nodes pointing to node u, N v is the number of outgoing links from a neighboring node v, and 1 − c is the probability of returning to the starting set of nodes. I(u) is a function of the starting nodes and maps a node u to the probability of the random surfer returning to it when it returns to the starting set of nodes. For example, if we define the starting set to consist of a single node s, then we can set I(s) = 1 and I(x) = 0 if x = s. This approach can be applied to a citation network or a network of bibliographic coupling, as shown by Grčar et al. [47]. We compute the PPR vector R for each paper p by setting I(p) to 1 and all the other components of I to 0. This produces a vector of all the papers, where the papers close to p have higher values than the papers farther away from p. Two nodes in a network are similar if the PPR algorithm VOLUME 4, 2016 computes similar vectors for them. This is useful in the search for similar papers, as shown in our previous work [29].

Node2vec.
A disadvantage of the Personal PageRank vectorization is that it only considers the distances between the nodes of a network, while useful information can also be present in the shape of the network. Node2vec attempts to capture such information using a vectorization approach. It first defines the node neighborhood using a random walk. Let u be the starting node and c the sequence generated by a random walk, with c 0 = u. The random walk is generated using the following equation: where E is the set of edges, π vx is the transition probability between v and x, and Z is the normalization constant such that the probabilities π vx sum to 1. The probabilities π vx are defined as π vx = α pq (t, x) · w vx , where v is the current node of the walk, x the next node of the walk and t the previous node of the walk. In weighted networks, w vx is the weight of the edge connecting v and x. In unweighted networks, w vx = 1. Parameter α pq (t, x) is called the walk bias and is defined as: where d tx is the unweighted distance between t and x, and p and q are parameters used to influence the behavior of the search. Fig. 2 shows a graphical representation of the search bias. When q > 1 the search prefers nodes close to the start; if q < 1, nodes further away from the starting node are preferred, and p determines how likely the search is to backtrack to already-visited nodes. With p > max(q, 1), the search will rarely backtrack. The parameters p and q allow for different search strategies and make node2vec more general than Personalized PageRank.
Using the neighborhood generated by the random walk, node2vec trains a shallow neural network to predict the neighborhood of a given node. As with word2vec, the weights of the trained neural networks are used as the vector representations of nodes.

C. CONSTRAINED CLUSTERING
After the extraction of features from the text and metadata, we use the obtained numerical representation to cluster similar papers into groups that match a predefined conference schedule. Commonly used clustering algorithms, such as kmeans clustering [32] and Mean Shift [48] are not suitable for this task since we cannot specify both the number and size of the clusters. The constrained-clustering algorithms like COP k-means [49] are also not suitable since they define constraints at the level of individual instances, e.g., two instances cannot appear in the same cluster. Our approach Node v is the current node and t is the previous node of the walk. Edges to t have a bias αpq(t, x) = 1/p. Edges to neighbors of t have a bias αpq(t, x) = 1. Other edges have a bias αpq(t, x) = 1/q. is similar to the work of Ganganath et al. [50] (see Section II for differences). Our algorithm consists of two steps: i) a modification of the k-means clustering algorithm that allows constraints on the size of each cluster, and ii) additional optimization that takes into account the structure of the schedule and reduces the overlap of papers with similar topics in parallel sessions. The details of the two steps are described below.
Initial constrained clustering.
Let C 1 , C 2 , ..., C n be the n clusters we want the algorithm to return, and s 1 , s 2 , ..., s n be the desired sizes of these clusters. The clusters are initially empty. As with the k-means algorithm, each cluster C i is assigned the mean µ i . We select the initial means using the k-means++ algorithm [51], which modifies the initial cluster selection by sampling points with a weighted probability based on the square distance to the nearest already-chosen cluster center (as opposed to doing it randomly, as is the case with the standard k-means). This initialization leads to better initial cluster centers and reduces the final error.
In the next step, we compute the potential error e i , i.e., for each paper representation x i , we compute the difference between the distance to its closest mean µ 1 i and distance to its second-closest mean µ 2 i : The potential error e i occurs if we fail to assign a paper to the closest cluster (represented by µ 1 i ) and have to settle for the second-best choice (represented by µ 2 i ). This quantity is a realistic assessment of the error and we use it in subsequent optimization. The idea for this computation comes from the margin-based loss used in some classification approaches, e.g., random forests [52]. Using the potential error to guide the paper-placement process and later local optimization is a novelty of our algorithm.
We sort the papers x j according to the potential error e i in descending order. Wrongly assigning papers with a large e i produces a large error, so we prioritize their placement. We greedily assign papers to their closest clusters according to their e i , starting with the paper with the largest e i . If we attempt to assign a paper to a full cluster, we mark the cluster as full and recalculate e i for all unassigned papers that use that cluster in the calculation of their e i . We re-sort unassigned papers and continue with the assignment process until any unassigned paper remains.
We improve the initial cluster assignment of the papers by local optimization, i.e., we greedily swap papers that reduce the overall clustering error. We define the error of the clustering as the sum of the distances between papers and the mean µ i of their cluster c i : To reduce , we iteratively perform the following steps: 1) Calculate new means. Each mean µ i is recalculated to represent the centroid of its cluster. We calculate the new means as: 2) Calculate distances to means. For each paper x j we calculate the distances to all the means µ i : 3) Sort the papers. We sort the papers according to their potential error, i.e., the difference between the distance to their current cluster mean D(x j , µ i ) and the distance to the closest mean outside of their cluster D(x j , µ k ).
If D(x j , µ i )−D(x j , µ k ) is larger than 0, than the paper is closer to the mean of cluster k than its current cluster. The error would decrease if the paper is reassigned to cluster k. Papers where D(x j , µ i ) − D(x j , µ k ) is largest would benefit most by being assigned to another cluster, so we attempt to reassign these papers first. 4) Swap the papers. Since we are constrained by cluster sizes, we cannot simply reassign a paper to the better cluster k. Instead, we check whether a paper from the cluster k would benefit from being assigned to i. If such a paper exists, we swap the two papers. We swap the papers in the order computed in step 3. We repeat steps 1-4 until no swap improves the clustering error . Since we only perform a swap if it reduces the distance of both papers to their means, the method is guaranteed to terminate. The pseudocode of the paper swapping step is presented in Algorithm 1.

Schedule-based constrained clustering
In the next step we present a clustering method that takes into account the schedule of a conference. The approach described above produces clusters that have the same sizes as slots in a conference, but does not take into account which of these slots represent plenary or parallel sessions. Our goal Algorithm 1 Pseudocode of paper swapping.

1: X ← list of papers sorted by
Remove x from X 5: end if 6: end for 7: for all x ∈ X do 8: µ ← mean of cluster of x 9: k ← second closest cluster of x 10: L ← all papers in cluster k 11: if L = [] then 12: for all x ∈ L do 13: µ ← mean of cluster of x 14: if (D(x, µ ) < D(x, µ))&(D(x , µ) < D(x , µ )) then 15: swap the cluster membership of x and is to generate a schedule where papers from the same topic do not occur in parallel sessions running simultaneously. We define another error function to evaluate the quality of the generated schedule with regard to that goal. If a conference has no parallel sessions, this step is skipped.
Ideally, the error function would be the total number of the same topic paper pairs assigned to two simultaneous parallel sessions. However, during clustering we do not know the exact topics of papers. Instead, we assume that the ideal cluster of a paper is the one with its mean being the closest to the paper; therefore, we define the error as the number of paper pairs belonging to the same ideal cluster, but clustered into different simultaneous parallel sessions. To calculate this error, we count the number of overlaps between each pair of clusters.
We reduce this error in two steps. First, we perform greedy slot assignment. We count the number of overlaps between each pair of clusters and assign clusters to slots in the following way: 1) Assign clusters with the largest overlaps into plenary sections to ensure no other session will occur at the same time. 2) Assign clusters with the largest remaining overlap to the first slot in each parallel session to ensure they will not overlap.

3) Assign the remaining clusters to the parallel session
where they have the lowest overlap with other clusters already in the session.
The pseudocode of the procedure is presented in Algorithm 2.
The approach is not guaranteed to minimize the error function. The error can be further reduced by swapping the papers between clusters to reduce the overlap in parallel sessions. We treat this as another optimization problem and use a genetic algorithm to further reduce the error function. The initial clustering and greedy slot assignments are not strictly necessary. Instead, we could have optimized the final sched- VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. This schedule-based optimization returns better schedules than the other two approaches as, besides the schedule structure, it also takes into account the parallel sessions. However, the error function used during the optimization is still an approximation of the real error function, which cannot be obtained without knowing the exact topics of the papers.
Several more complex algorithms could be used to perform the optimization (e.g., the approaches presented by Abualigah et al. [53] and Abualigah et al. [54]). However, our preliminary experiments show that the genetic algorithm used reduces the error function to almost zero in all the test cases, making the use of more advanced approaches unnecessary.
Another way to calculate the error function would be to perform topic assignment beforehand (e.g., using the LDA algorithm [55]). In principle, this would better approximate the true paper topics. However, such an approach can be problematic for broad topics with multiple sub-fields (e.g., deep learning). Such topics would need to be split into properly sized sub-topics that would fit the conference schedule, and the problem would translate from constrained-clustering to constrained-topic assignment. We leave the testing of this approach for future work.

IV. RESULTS AND DISCUSSION
An objective evaluation of the created schedules is difficult. A comparison of the automatically generated schedules with the ones used in actual conferences could be misleading, since humans could produce many different acceptable schedules, even if their semantically equivalent permutations were taken into account. Besides, a dataset of actual schedules and conference papers does not exist. To overcome these problems, we first present a newly created database of research papers from the field of ML in Section IV-A.
We use this database to evaluate the two main components of our approach, finding similar papers in Section IV-B and constrained clustering in IV-C. We test the approach as a whole on datasets from actual conferences in Section IV-D. Our approach is based on contextual cross-lingual representations of text and terminology. Section IV-E contains the evaluation in a multilingual setting. Finally, in Section IV-F, we comment on the implementation and reuse of the NeSyChair system.

A. DEVELOPMENT DATABASE OF MACHINE-LEARNING PAPERS
To evaluate how well we can detect papers belonging to similar research topics, we constructed a database of research papers. From this database we extracted several types of features and evaluated them in Section IV-B. We used papers from six ML conferences. Table 2 presents the number of papers from each conference. ECML-PKDD 2015 and ECML-PKDD 2016 contained more papers than those listed, but we only selected those that are publicly available.
We labeled each paper with its field of research so that we could evaluate how good our approach is at finding similar papers. For the papers from the ECML-PKDD 2015, ECML-PKDD 2016, and ICML 2016 conferences, we used the labels of the sessions the papers were presented at during the conference. In cases where a label did not appear in all three conferences, we labeled the paper with the closest field from the ICML 2016 conference. For fields with direct counterparts, we renamed the field to the name used in the ICML 2016 conference (e.g., we renamed bandits to bandit learning). If a field did not have a direct counterpart (e.g., Rich Data from ECML-PKDD 2015), we discarded the field and the papers that appeared in it. Table 3 shows the session labels of those conferences. In addition, to increase the variability of the content, we manually labeled a selection of publicly available papers from the SIGKDD, AISTATS, and NIPS conferences (from 2014-2016) by using the same set of labels. We only labeled the papers with clearly defined research areas. Specifically, we selected 20 papers from each of the fields deep learning, networks and graphs, kernel methods, optimization, probabilistic methods, reinforcement learning, time series analysis, supervised learning, topic models, and unsupervised learning.

B. EVALUATING PAPER SIMILARITY WITH FEATURE SELECTION AND CLASSIFICATION
To evaluate our approach for finding semantically similar papers, we treat it as a classification problem, using the database of scientific papers described above in Section IV-A. As the database contains 615 papers belonging to various ML fields, it is large enough to cover most real, large conferences. We use it to compare several combinations of the proposed text and metadata representations and feature-selection methods. In our classification dataset, we use session labels from the ICML 2016 conference as class values. We extracted several types of features and evaluated them in two ways: 1) By using feature-selection algorithms to identify the best features. We used three methods: analysis of variance (ANOVA), ReliefF, and mutual information (MI). ANOVA [56] measures how features differ statistically between the classes. ReliefF [57] weights features according to their ability to distinguish between near instances with different class values. MI [58] is an information-theoretic function measuring the mutual information between the feature and class random variables. In our tests, ANOVA returned the best results. 2) By training a classifier. We trained several classifierslogistic regression (LR), random forests (RF), and a fine-tuned neural network BERT models-on different subsets of features. The classification accuracy served as a criterion to determine the most beneficial feature subset for finding similar papers. The best features chosen by each feature-selection method were evaluated using the classification accuracy of the trained models. We evaluated the following feature groups: • components of BoW vectors weighted with TF-IDF, • bibliographic coupling network converted to vector form using node2vec, • bibliographic coupling network converted to vector form using PPR, • occurrence vectors of extracted terminology, • co-bidding graph of reviewers converted to vector form using node2vec, • co-bidding graph of reviewers converted to vector form using PPR, • sentence, paragraph, and document embeddings obtained with four different models (skip-grams, doc2vec, USE, and BERT), • all the extracted features, • the best x features selected using feature-selection algorithms for various values of x. Table 4 shows how many features of different types were included in the top 1000 features using different featureselection methods. All methods selected the majority of features from the BoW representation.
We evaluated each group of features and several combinations of the best features using several classifiers. The database of research papers served as a training set to predict the research field of a given paper. We used different combinations of features and three classification algorithms: LR [59] with l2 norm, RF [52] with 500 trees and an unlimited maximum depth, and support vector machines (SVM) [60] with the linear kernel. Note that LR functions exactly the same as the last softmax layer of neural networks. Additionally, we train an English BERT model following the implementation presented by Turc et al. [4]. We report these results separately at the end of the section.
The classification accuracy estimated with the 10-fold cross-validation is presented in Table 5. In total, the database contains papers from 39 research fields, with the default classifier achieving the classification accuracy of 6.1% (the proportion of the most frequent class). We obtained the best results with the RF classifier on the best 1500 features selected by ANOVA. Among the individual groups of features, the extracted terminology features and bibliographic coupling network features produced good results when using LR. The BoW vectors weighted with TF-IDF worked well in combination with other features and individually when using RF. The doc2vec representation performed poorly with all the algorithms. The vectors obtained with USE performed better and gave the best results when splitting the papers into smaller chunks of text while using RF.
The co-bidding features are not directly comparable with others, as we could only test them on 136 papers from the ECML-PKDD 2017 conference (we did not have access to the reviewers' biddings from the other conferences).
By using the top 1500 features, we achieved a classification accuracy of 57.3%. The classifiers were the most accurate when classifying distinct research fields, such as kernel methods and topic models. Papers from sub-fields were often wrongly assigned to a broader field, e.g., papers from deep learning and vision and deep learning computations were often assigned to deep learning). Another problem is that papers often contain approaches from different fields, which can lead to miss-classifications. For example, due to the popularity of deep learning, papers from various areas use deep-learning methods. Miss-classifications are illustrated in Fig. 3, which shows a confusion matrix of the RF classifier trained on the top 1500 features. The classifier was trained on 70% of the dataset and tested on the remaining 30%. The numbers off the diagonal represent miss-classifications. We can observe that many fall into similar topics.
When constructing a conference schedule, missclassifications into similar topics can be non-problematic. A session containing papers from various sub-areas of deep learning could be acceptable, since all papers relate to deep learning. A paper that primarily deals with some other field (e.g., time-series analysis) but uses deep-learning methods could be placed in the same session because it is related to deep learning. To address this issue, we tested our approach with the top-N classification accuracy, where the predictions are considered correct if the ground-truth class is within the top-N predicted classes. In this test we used the same RF classifier used for construction of the confusion matrix. The results are shown in

Fine-tuning BERT model
Unlike with other types of features, for the BERT-based representation, we perform the classification using an additional softmax layer of a neural network [4]. We train the model for 10 epochs using a batch size of 32. Due to the relatively small size of our dataset compared to the large number of classes (645 papers split into 39 research topics), larger training times lead to overfitting. We achieve a classification accuracy of 40%, which is not competitive with the best classifiers using selected subsets of features.

C. CLUSTERING EVALUATION ON SYNTHETIC DATA
The performance of the paper clustering depends on the used features. Ideally, papers with similar topics would have similar features. Unsurprisingly, the preliminary results showed that with our features we cannot produce perfect clusters. Errors in clustering could be a result of imperfect features or shortcomings of the clustering algorithm. To avoid this problem, we first evaluate the clustering algorithms in a controlled environment, using a synthetic dataset that mimics the distribution of features in the real datasets. For each of the M groups, representing topics in our dataset, we generate N two-dimensional points, representing papers. Each group is sampled from a two-dimensional Gaussian distribution with a mean vector µ and a covariance matrix Σ, where We vary N , M , and σ of the Gaussian distributions to monitor algorithms' behaviour with different input data. The combinations used in our tests are listed in Table 7.    Table 7. The parameters were chosen so that the generated groups overlap. This makes the correct clustering challenging, but reflects the multiple topics present in the actual papers. We compare several variations of clustering algorithms, described in Section III-C to determine which is the most suitable for our task. We evaluate the following variations: 1) random clustering, used as a baseline, 2) initial constrained clustering, 3) greedy slot assignment, 4) schedule-based optimization.
To evaluate the performance of different variants, we compute the error function counting the number of overlaps (i.e., papers with the same topic) between each pair of clusters, as described in Section III.
The goal is to reduce the number of papers from the same topic occurring simultaneously in the parallel sessions of the schedule as this would prevent conference participants from attending all the presentations from their field of interest.
In the synthetic datasets, the ground-truth research areas are Gaussian distributions from which points (i.e., the papers) are generated. Table 8 shows the errors for several clustering variants on four synthetic datasets. We report two values: i) the estimated error est (i.e., using clusters to approximate topics); this error is computed in the optimization algorithm as a proxy for the actual error, and ii) the actual error act computed using the ground-truth Gaussians that generated the points (i.e., topics). The actual error act is not used during the optimization, as in practice it is inaccessible. We use est as its proxy so there is no need to manually assign the exact sub-topics to papers. For this synthetic data we can report act and show how well the estimated error approximates the actual error.
The results are encouraging and show that each successive step of the proposed clustering approach in almost all cases VOLUME 4, 2016    reduces both the estimated error used in the optimization and the actual error calculated from the ground truth generating distributions. There is a strong correlation between the estimated and actual error, but the actual error is larger than the estimated error. In some cases, e.g., the (20, 10, 0.2) triplet in Fig. 8, the final optimization step increases the actual error, despite the estimated error decreasing. This happens when one or more Gaussian distributions overlap, making it difficult to determine the actual generating distribution of points. In such cases, the estimated error is an inadequate estimate of the actual error. Similar situations can occur with real-world data with papers from similar or overlapping research fields (e.g., 'machine learning' and 'deep learning'). Such situations are problematic for clustering approaches.

D. ENTIRE NESYCHAIR SYSTEM EVALUATION
We evaluated the NeSyChair conference scheduling system on a real-world use case, i.e., the papers accepted and presented at the ECML-PKDD 2017 conference. Because of our involvement in the organization of this conference, we have all the data and metadata available for this experimental evaluation. We used the predefined structure of the conference's schedule, consisting of 136 paper presentations split into 29 sessions. The structure of the schedule is presented in Table  9. All the sessions, except four 60-minute sessions on the second day, were 100 minutes long. Each paper presentation took 20 minutes, so sessions contained either five or three presentations each.  Following the procedure described in Section III, we first construct the features, learn papers' vector representations by concatenating the representations of different feature types, and run the clustering algorithms. We use the session labels from the ECML-PKDD 2017 conference as the ground truth instead of the Gaussian distribution data. The results are shown in Table 10.   Method   act  est   Random clustering  376  115  Initial constrained clustering  854  346  Greedy slot assignment  89  92  Schedule-based optimization  72  2   TABLE 10. Results on the actual ECML-PKDD 2017 data.
As in the previous component testing, both the actual and estimated error are substantially reduced using greedy slot assignment and further schedule-based optimization. As before, the actual error is larger than the estimated error but they closely correlate. Again, this is due to the overlap between research fields and multiple topics contained in many papers. The low estimated error indicates that the selected features do not provide much additional information to enable improvements in topic prediction. Manual inspection of the produced schedule shows that if the NeSyChair scheduler was used to produce a draft schedule, it would already represent a sensible schedule. Using the provided user interface, the conference organizers could easily modify it into a production schedule, or interactively freeze certain parts and allow the system to fill in the rest.

E. MULTILINGUAL EVALUATION
Our system does not need a labeled dataset, and uses only cross-lingual and language-independent components. This makes it suitable for use in languages other than English.
To evaluate how well NeSyChair works on languages other than English, we evaluate it on a real-world conference with parallel sessions and publicly accessible papers. We selected the Language Technologies & Digital Humanities 2016 (JTDH 2016) conference [61]. This Slovenian conference consists of 32 Slovenian research papers split into 8 sections. Additionally, the conference contains a number of student and English papers, which were not included in this evaluation. In the original schedule, English papers were presented separately from the Slovene papers, so we chose to exclude them from the clustering. The schedule contains 4 parallel slots of Slovene presentations, each containing two simultaneous sessions. Each session contained between 5 and 3 papers.
Due to its small size, JTDH 2016 does not assign specific topic names to sessions 1 . All the papers are from the field of Natural Language Processing, and no specific sub-fields are assigned to any of the sessions. However, the papers in each session still contain related papers (e.g., one of the sessions contains papers related to speech processing, while another contains papers related to digital humanities) so we use the sessions as the ground truth labels.
In assigning the papers to the schedule, we followed the approach presented in Section IV-D, but used the actual schedule structure of JTDH 2016. The results of the 1 http://www.sdjt.si/wp/dogodki/konference/jtdh-2016/urnik-2016/ evaluation are presented in Table 11. As with the English conference, each step of our approach reduced the estimated error and the ground-truth error. This shows that our approach can work on non-English articles as well. Both the estimated and the ground-truth errors are lower than in ECML-PKDD 2017 due to fewer papers.

F. IMPLEMENTATION AND REUSE
We implemented our approach as a web application. NeSy-Chair allows conference organizers to define or reuse the conference-schedule structure and use the automatic schedule to assign accepted papers in the schedule. The NeSyChair application is publicly available under a permissive license 2 .

V. CONCLUSIONS
The paper presents the NeSyChair system, automatically constructing conference-schedule drafts using a combination of machine learning, natural language processing, network analysis, and combinatorial optimization. The system assigns papers to a predefined schedule structure by minimizing the number of topical papers occurring simultaneously in parallel sessions. To implement the automatic scheduler, we solved two problems. We applied methods from NLP and ML to identify papers that address similar research topics, and network analysis to extract information from the available metadata. We assigned papers to the conference schedule using the modified, constrained k-means clustering algorithm that takes into account the size and the number of clusters. We further optimized the initial schedules taking into account the structure of the schedule and the overlap between papers with similar topics. We evaluated the components of our approach on the newly constructed synthetic data and the database of research papers from six artificial intelligence and ML conferences. The entire system was used to reconstruct the schedule of the ECML-PKDD 2017 conference, using the accepted papers, metadata and the actual structure of the conference. The multilingual capability of the system was demonstrated on the Slovene JTDH 2016 conference. The developed datasets are available by request and can be used for future research.
In practice, the NeSyChair application is not entirely automatic due to additional unexpressed preferences, e.g., participants arriving late or leaving before the end of the conference. In such cases, the produced schedule represents an excellent starting point for the organizers. The provided user interface allows for iterative schedule construction and takes manual entries into account when automatically filling in the remaining slots.
The proposed approach can be improved in several ways. It might benefit from additional features from the fields of semantic analysis and network analysis. Additional useful features could be contained in the submitted papers, such as keywords manually selected by the papers' authors. Conference organizers would benefit if the proposed automatic scheduling was integrated into existing conferencemanagement tools, such as EasyChair.
For predicting a paper's research field, we currently only predict a single field per paper, while research papers often integrate approaches from several fields. Multi-label classification, where each paper can be labeled with an arbitrary number of research fields, might be a suitable alternative, providing useful information to conference organizers and additional features for clustering the papers into sessions. In future work, we also plan to perform a more extensive evaluation that will take into account multi-label classification. Additionally, we plan to explore whether similar approaches can be used to further improve the schedule-based constrained clustering.