Multi-Label Classification With Hyperdimensional Representations

Hyperdimensional computing (HDC) is a computational paradigm that leverages the mathematical properties of high-dimensional vector spaces to manipulate data as symbolic entities using a set of neurally plausible operations. Although HDC has demonstrated remarkable success in cognitive tasks, its potential in complex applications such as multi-label classificati has yet to be explored. In this research paper, we introduce three approaches to multi-label classification that strike a balance between computational efficiency and accuracy, based on the complexity of the problem. The first approach we propose is Power Set HD, a transformation method that is ideal for small-scale multi-label classification with label cardinality less than four and label set size less than ten. The second approach, One-vs-All HD, is another transformation method that is suitable for slightly more complex tasks with higher label cardinality, providing a better efficiency-accuracy trade-off over Power Set HD. However, due to the expensive linear complexity scaling of One-vs-All HD, we propose a novel neural approach called TinyXML HD for extreme scale tasks. This method learns hyperdimensional representations by decomposing the learning problem into multiple sub-problems, which are solved neurally through gradient-based optimization. Importantly, TinyXML HD fixes the output size of the model to the dimensionality of the hypervector, regardless of the label size, thereby scaling only by a small constant when evaluated on datasets with extremely large label spaces. Our approaches offer a valuable trade-off between computational efficiency and accuracy. We show that our methods provide a speedup of 16-60x on state of the art datasets, while maintaining comparable accuracy. Furthermore, our methods yield models that are 56x smaller on medium-scale tasks and up to 836x smaller on extreme-scale datasets, which is a significant reduction in model size while still achieving high accuracy.


I. INTRODUCTION
Hyperdimensional computing (HDC) is an emerging paradigm of computing that offers a promising alternative to traditional machine learning approaches.In recent years, HDC has garnered significant interest due to its low computational overhead and hardware-friendly nature [1], [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Julien Le Kernec .
HDC employs low-precision sparse representations and simple arithmetic operations to manipulate high-dimensional vectors, making it amenable to hardware acceleration.The independent and identically distributed (iid) nature of these representations further enables efficient parallelization, leading to improved computational efficiency.There is a large body of works showing benefits of HDC acceleration in hardware for various applications in IoT [3], [4], [5] and Machine Learning [6], [7], [8].As a result, HDC has gained popularity in various domains, including natural language processing [9], [10], biomedical applications like DNA pattern matching [11] and protein alignment [12] and robotics [13].Despite its success, the applicability of HDC for complex tasks like multi-label classification, which has real-world applications in recommender systems and document classification, has not been explored.
Our research presents the first comprehensive exploration of multi-label learning problems utilizing HDC representations.We introduce three novel approaches that strike an optimal balance between computational efficiency and accuracy.Our first approach, Power Set HD, is a transformation method that achieves exceptional accuracy and efficiency on datasets with small label spaces and a limited subset of possible label combinations.For small datasets with larger label cardinality (≥4), we propose One-vs-All HD, which reduces the exponential complexity scaling of Power Set HD to a linear scale on the label set size, making it ideal for datasets with label sizes up to 30.In addition, we present TinyXML HD, a neural approach to learning mappings between hypervectors by decomposing the problem into multiple sub-problems.TinyXML HD fixes the output dimensionality of the model independent of the label size or cardinality, making it an ideal candidate for extreme multi-label classification problems.
HDC leverages neurally plausible representations of data and associates abstract concepts with high-dimensional vectors to perform complex cognitive tasks.The two fundamental operations of HDC are ''bundling'' and ''binding'' [2].Bundling (denoted by ⊕) is used to represent multiple symbolic entities (hypervectors) using a single hypervector, while binding (denoted by ⊗) associates one entity with another.
We leverage the Multiply Add Permute (MAP) architecture proposed by Gayler [14], which uses bipolar representations for HDC.MAP represents data using high-dimensional vectors X ∈ {+1, −1} D called hypervectors.Gayler demonstrated that by assigning hypervectors with a conceptual meanin, we can represent conceptual relationships using these operators.For example, the sentence ''Yoda is a Jedi and Leia is a princess'' can be represented as H = Yoda ⊗ Jedi⊕Leia⊗princess.HDC also allows us to query and reason about expressions.For instance, to find who is a Jedi, using the inverse operator Jedi we can simply compose H ⊗ Jedi ≈ Yoda, which results in a hypervector that is approximately equal to Yoda.
Gradient-based neural methods have demonstrated tremendous success in various learning tasks [15].They provide a systematic approach to finding the minima of a function [16] and can be efficiently computed provided the function is differentiable.The MAP framework uses element-wise products or additions that are themselves differentiable, however, the quantization step that follows is not differentiable.There are other HDC models with fully differentiable operations such as Holographic Reduced Representations (HRR) [17] which have been studied by using a neural gradient-based approach in the context of multi-label classification like [18].However, the HRR framework requires Fast Fourier Transform (FFT) operations which increase complexity.The MAP model in contrast, uses simple operations that can be accelerated using efficient bit-wise operations.
In [18], the authors utilized symbolic hypervectors to represent the labels and employed neural methods to learn a mapping from the instance space to the label space.To simplify the learning problem, we instead embedded both inputs and labels in the same high-dimensional vector space, which can be learned more easily than mapping across different vector spaces [19], [20], [21].This is because the model does not need to learn complex transformations to map between spaces, reducing the complexity of the learning problem.After embedding inputs and labels as hypervectors in the same vector space, TinyXML HD learns a mapping between the two using a 1-D convolutional neural network designed for processing hypervectors.
In this work we introduce three methods that show the potential of HDC to solve multi-label classification problems across the entire spectrum of complexity -small, medium, and large, as described below: • The first approach, Powerset HD, is suitable for smallscale multi-label classification, where each possible label combination is instantiated as a separate binary learning problem, resulting in exponential scaling over label size.This approach yields high-accuracy models that scale well for datasets with a few label combinations.
• The second approach, One-vs-All HD, is another transformation method that relaxes the exponential scaling of Powerset HD to linear scaling over label size, resulting in models that are efficient and accurate for datasets with a label set size of up to 30.Beyond this limit, the training time increases significantly, making this method less suitable.
• For extreme-scale multi-label problems, we propose TinyXML HD, which utilizes a 1-D convolutional neural network to learn hypervector representations.By having a fixed output dimensionality independent of the label complexity, TinyXML HD achieves remarkable speedups in training.However, due to the relatively expensive convolution operations, the first two approaches provide a better trade-off between computational efficiency and accuracy for smaller size datasets.
• Through rigorous evaluations on real-world datasets, we demonstrate the superiority of our proposed methods.Powerset HD and One-vs-All HD offer up to 60x speedup on small-scale datasets, while TinyXML HD is 56x smaller compared to the state-of-the-art on medium-scale datasets and up to 836x smaller on extreme-scale datasets, all while maintaining comparable accuracy.The rest of the article is organized as follows.We first review related work in II, highlighting the differences between our approach and existing methods.In section III, we provide an overview of HDC helpful for understanding the rest of the article.We split the problem into two variants: micro multi-label classification and extreme multilabel classification.We first tackle the micro multi-label classification in IV, where we discuss two simple problem transformation techniques and examine their performance on trivial learning problems in VI-A.We provide an overview of the Extreme Multi-Label Classification in V, followed by Section V-A where we detail a new encoding method for representing text data as hypervectors.We then present our novel HDC convolution operator and neuro-symbolic approach in V-B, detailing its formulation and demonstrating its effectiveness in Section VI-B.

II. RELATED WORK
Hyperdimensional computing (HDC) is an emerging field that aims to address the limitations of traditional computing paradigms by leveraging high-dimensional vector representations to perform complex cognitive and machine learning tasks.This section presents a brief summary of various HDC works, highlighting their contributions to both cognitive tasks and machine learning tasks.We also present a brief survey of gradient based algorithms and multi-label classification for the readers benefit.

A. HDC FOR COGNITIVE AND LEARNING TASKS
Kanerva introduced the foundational concept of a ''hyperdimensional computer'', which efficiently stores and retrieves information using large, sparse binary vectors [2].This model exhibited robustness and efficiency in cognitive tasks, inspiring further research in HDC.Gallant et al. explored HDC in natural language understanding and reasoning, successfully capturing semantic relations in text and showcasing its potential for large-scale knowledge representation [22].Rachkovskij expanded HDC's application to image processing and recognition, demonstrating pattern recognition capabilities with high accuracy and noise robustness [23].In [24] Anthony et al. develops a theoretical framework for HDC and details the mathematical properties of HDC encoding methods.
In recent years, hyperdimensional computing (HDC) has emerged as a promising paradigm for machine learning, such as classification, regression, and reinforcement learning.Lai et al. employed HDC for classification tasks, developing a high-dimensional classifier that achieved competitive performance with reduced computational complexity [25].Imani et al. applied HDC to regression problems, proposing a high-dimensional computing framework that provided accurate and efficient regression models with minimized computational overhead [26].Goudarzi et al. explored HDC in reinforcement learning, developing a state representation and policy learning approach that demonstrated effectiveness in various environments [27].Imani et al. proposed HDCluster, an accurate clustering algorithm for high-dimensional datasets using hyperdimensional computing [28].In GENERIC [3], Khaleghi et al. proposed a novel and efficient method for learning on edge devices using hyperdimensional computing for a wide range of applications.The method utilizes hardware-friendly hyperdimensional vector representations and an optimized training algorithm to reduce computation and storage requirements while maintaining high accuracy.Guo et al. [7] proposed using hypervectors to represent users and items and performs a set of associative and distributive operations on these vectors to compute recommendations.The paper presents three different methods for generating recommendations, including one that combines hyperdimensional computing with matrix factorization.Asgarinejad et al. [29] developed a method for epilepsy detection using EEG signals.They demonstrate using real-world data that HDC approaches outperforms state-of-the-art methods like Support Vector Machines (SVM) [30] and Convolutional Neural Networks (CNN) [31].

B. GRADIENT BASED HDC METHODS
Gradient-based methods in hyperdimensional computing (HDC) have garnered interest due to their potential for addressing optimization challenges in high-dimensional spaces.These methods extend the capabilities of HDC by incorporating gradient information to guide the learning process.For instance, Frady and Sommer introduced a gradient-based HDC framework, which allowed the use of optimization algorithms such as gradient descent and backpropagation in HDC settings [32].In a subsequent work, Frady et al. proposed a method for gradient-based learning in HDC that utilized iterative projections and local linearizations to facilitate learning in high-dimensional spaces [33].Building upon these developments, Wang et al. presented a gradient-based HDC algorithm for clustering and classification tasks, which employed a convex optimization formulation to enhance HDC's performance in these applications [34].Moreover, Su et al. developed a gradient-based HDC algorithm for deep learning, illustrating the potential of gradient-based methods in improving the robustness and expressiveness of HDC models [35].Recently, Zhou et al. presented a gradient-based HDC framework for unsupervised learning, focusing on clustering and dimensionality reduction tasks [36].These studies highlight the increasing importance of gradient-based methods in HDC and their potential in addressing various learning tasks in high-dimensional spaces.
While these methods have shown promise in addressing optimization challenges in high-dimensional space, they introduce additional complexity in order to facilitate backpropogation through the HDC operations.For example, Frady and Somer's work involves iterative projections and local linearizations which can be expensive.Similarly Wang's convex optimization formulation for clustering and classification tasks can result in increased computational overhead [34] Prior works have also explored the use Holographic Reduced Representations (HRR), a family models for gradient based learning tasks.A notable attempt to capitalize on the symbolic properties of HRR was made by Nickel et al. [37], who utilized binding operations to link elements within a knowledge graph.Their approach served as an embedding mechanism that merged two vectors of information without increasing the dimensionality of the representation, as opposed to concatenation which doubles the dimension.In a more recent study, Liao and Yuan [38] employed circular convolution as a substitute for standard convolution to decrease model size and inference time, albeit without leveraging the symbolic properties inherent to HRRs.Although Danihelka et al. [39] claimed to incorporate HRR into an LSTM, their methodology simply augmented an LSTM with complex weights and activations, and did not genuinely implement HRR due to the absence of circular convolution.

C. MULTI-LABEL CLASSIFICATION
The seminal works of multi-label classification emerged in the early 2000s with the introduction of the problem and initial approaches [40].Since then a wide range of algorithms and techniques have been proposed to tackle this problem such as problem transformation methods [41], and ensemble methods [42], [43].
Among the various methods for multi-label classification, problem transformation methods have gained considerable attention.These techniques transform the multi-label problem into one or more single-label problems, which can then be addressed using traditional machine learning classifiers.One popular approach is the Binary Relevance (BR) method [44], which independently trains a binary classifier for each label.Another problem transformation method is the Label Powerset (LP) method [44], which treats each unique combination of labels as a single class in a multi-class problem.To address the shortcomings of BR and LP, researchers have proposed various ensemble and hybrid techniques.These include the Random k-Labelsets (RAkEL) method [45], which constructs multiple LP classifiers on random label subsets, and the Classifier Chains (CC) method [46], which constructs a chain of binary classifiers while preserving label correlations.Apart from problem transformation methods, other multi-label classification techniques include algorithm adaptation methods, which modify single-label algorithms to handle multi-label data directly.Examples of such methods are the Multi-Label k-Nearest Neighbors (ML-kNN) algorithm [47], and the Multi-Label Decision Trees (MLDT) [48].
Extreme multi-label classification (XML) is a specialized form of multi-label classification, characterized by a large number of labels and instances.XML has attracted significant research attention due to its relevance in numerous realworld applications, such as large-scale document classification [49], image annotation [50], and gene function prediction [51].Early approaches for XML include the FastXML algorithm [52], PfastreXML algorithm [53], and the Parabel algorithm [54].Embedding-based methods, such as the SLEEC algorithm [49] and the AnnexML algorithm [55], have also been proposed for XML.
Deep learning approaches have shown considerable promise in XML tasks.Convolutional Neural Networks (CNNs) [56], Recurrent Neural Networks (RNNs) [57], and Transformer models [58] have been adapted for XML problems, demonstrating improved performance compared to traditional methods.Specifically, BERT [58] and its variants have been successfully applied to large-scale text classification tasks.

D. MOTIVATION AND OUR CONTRIBUTIONS
In the existing literature on hyperdimensional computing (HDC), the majority of studies have focused on small-scale learning problems.Ganesan et al. [18] examined the extreme multi-label text classification task, but other works have yet to explore the scalability of HDC techniques in addressing large-scale machine learning problems in real-world applications.Our research aims to bridge this gap by investigating the application of HDC to a demanding, industrial-scale learning problem.
State-of-the-art deep learning models for multi-label classification, such as X-Transformer [59] and LightXML [60], comprising millions of parameters, necessitate days of training to achieve optimal performance.Our objective in this work is to examine HDC's potential for reducing this training time, thereby offering a more balanced trade-off between computational efficiency and accuracy.
Ganesan et al. [18] proposed a method for extreme multi-label text classification that replaces the final classification layer of AttentionXML [61] and XML-CNN [62] with a fully connected layer that outputs a hypervector encoding the relevant label information for an instance.While they demonstrated that their proposed method achieves accuracy similar to the baseline implementations of AttentionXML and XML-CNN, there are two key areas to improve upon.First, their method uses the HRR binding operation, which is a circular convolution requiring Fast Fourier Transform [63], an expensive operation.Second, their method learns a mapping from the instance space (represented as onehot encoding) to the label space (represented as HRR hypervectors), which is a harder learning problem requiring learning the projection across the vector spaces.
In contrast, our approach is based on the Multiplicative Addition Perturbation (MAP) model introduced by Gayler [9], which uses bi-polar representations with simple element-wise arithmetic operations that can be easily accelerated and parallelized on hardware.Additionally, we embed the inputs and labels both in the same high-dimensional vector space, thereby avoiding the need to learn complex transformations across vector spaces.Our proposed neural approach for learning high-dimensional representations not only avoids increased computational complexity but also reduces the compute cost by a factor of 200.

III. HYPERDIMENSIONAL COMPUTING BACKGROUND
Hyperdimensional computing (HDC) is an emerging paradigm of computing that describes a family of representations and operations using high-dimensional vectors called hypervectors [2], [14], [17].The basic idea behind HDC is to represent structured or symbolic data using hypervectors and then provide a set of mathematical operations to manipulate these vectors like symbolic objects.These operations are associative, commutative, and distributive [64], they operate element-wise, allowing them to be performed in parallel, making HDC an attractive approach for implementing hardware-accelerated, energy-efficient computing.
Hypervectors are typically represented as binary or bipolar vectors in a high-dimensional space.Mathematically, a hypervector is represented by a vector X ∈ {+1, −1} D where D is the dimensionality of the vector space.The dimensionality of the hypervector is often much larger than the number of dimensions required to represent the data, enabling the vector to encode many concepts or attributes in a single representation.For instance, a hypervector representing an object might contain attributes such as color, shape, texture, and position.
A. HDC OPERATIONS HDC provides three fundamental operations: bundling, binding and similarity check.These operations are implemented differently in various HDC models, and we will briefly explain their usage under the Multiply-Add-Permute (MAP) model.In the MAP framework, hypervectors are bipolar and can be represented as X = {+1, −1} D .

1) BUNDLING
The bundling operation is used to represent multiple symbolic entities using a single hypervector.This operation is denoted by the ⊕ symbol and can be expressed as: bundle where X 1 , X 2 , . . ., X n are hypervectors representing the symbolic entities.The result of the bundling operation is a new hypervector that represents the combination of all the input entities.For MAP the bundling operation is a simple element-wise sum of the hypervectors.Under the similarity check metric defined below, the resultant hypervector is similar to its constituent hypervectors.

2) BINDING
The binding operation is used to associate one entity with another and is denoted by the ⊗ symbol.The binding operation is defined as the element-wise multiplication of two hypervectors and can be expressed as: where X and Y are the hypervectors representing the two entities to be associated.The result of the binding operation is a new hypervector that encodes the relationship between the two input entities.By the similarity metric, the resultant hypervector is orthogonal to the input hypervectors.

3) SIMILARITY CHECK
Finally, the similarity check operation is used to determine the degree of similarity between two hypervectors.The similarity check operation is defined as the dot product between two hypervectors and can be expressed as: where X and Y are the two hypervectors to be compared.The result of the similarity check operation is a scalar value that represents the degree of similarity between the two hypervectors, with higher values indicating greater similarity.
Together, these operations provide a powerful and flexible approach to representing and manipulating symbolic data in a distributed and parallel fashion, enabling the development of novel machine learning algorithms and cognitive models.

B. HDC LEARNING
The HDC learning process involves the encoding of data and its inherent relationships within hypervectors.These vectors are then subjected to a set of mathematical operations, enabling the extraction of useful patterns and relationships within the data.Learning in HDC involves three steps: encoding the data, learning on the encoded data and inference.

1) ENCODING
The first step in learning with HDC involves encoding the input data into high-dimensional hypervectors.The goal of this step is to create a distributed representation of the input data that can capture its semantic properties.Previous literature have proposed various techniques for encoding data into hypervectors each with different mathematical properties.One key distinction among these techniques is the way in which distance metrics are preserved within the encoded data when mapping to hypervectors.Here we provide a brief overview of our chosen encoding scheme, namely Random Projection Encoding (RPE) [24], which utilizes Gaussian distribution to generate the encoding vectors.
Given an input vector X ∈ R d , we generate a set of random projection vectors R 1 , R 2 , . . ., R K ∈ R D , where D ≫ d.The projection of X onto the k th random projection vector is given by the dot product X • R k .The resulting set of K projections can be represented as a hypervector H ∈ {+1, −1} K , where H i = sign(X • R i ).Random projection encoding in HDC has been shown to preserve Euclidean distance in the original vector space, mapping it to angular distance in the high-dimensional space [24].This similarity-preserving nature makes it suitable for encoding data by retaining complex relationships between them.

2) TRAINING
In HDC, training typically involves two steps: one-shot training followed by iterative retraining.One-shot training involves representing each class with a centroid hypervector that is the average of hypervectors representing the training examples for that class.Retraining involves updating the centroid hypervectors using a simple perceptron-style algorithm [65] in an iterative process that runs until 108462 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
convergence.During retraining, the centroid hypervectors are updated when a sample is mispredicted, with updates applied to both the correct label and the mispredicted label.

3) INFERENCE
The centroid vectors can be used to classify new data by measuring the similarity of the new data to each centroid vector.The class with the highest similarity measure is chosen as the predicted class for the new data point.Hamming distance similarity is used for binary hypervectors, while cosine similarity can be used for any other type of data.

IV. MULTI-LABEL OvA AND POWERSET HD
Multi-label classification is a machine learning problem where an instance can belong to multiple classes simultaneously.Mathematically, it can be defined as follows: Let x be a feature vector representing an instance and y be a binary vector indicating the presence or absence of L possible class labels, where y i = 1 indicates the instance belongs to the i th class and y i = 0 indicates otherwise.The goal of multi-label classification is to learn a mapping function f (•) that takes as input an instance x and outputs a binary vector of length L indicating the classes the instance belongs to.
The complexity of this task can be influenced by various factors, such as the number of labels, the label dependencies, and the label cardinality.To distinctly refer to the class of problems with relatively small label spaces, we define a small scale variant of multi-label learning.This variant deals with datasets where the label space is simple and the number of instances is relatively small, resulting in label set sizes of less than 100.In Section V we discuss the characteristics of the most challenging variant of multi-label classification that focuses on very large problem sizes.Our prior work [66] laid the foundations of hyperdimensional multi-label classification by combining HDC with two well studied problem transformation techniques, One-vs-All [67] and Label Powerset [68], to solve micro size problems.
Problem transformation methods [44], [69], [70] have been proposed where the original multi-label problem is transformed into multiple single-label problems.Each transformed problem corresponds to one of the L class labels and involves training a binary classifier to distinguish instances that belong to that class from those that do not.The output of each binary classifier is then combined to obtain the final multi-label prediction.These methods can be further classified into three categories: 1) One-vs-All [67], 2) Label Powerset [68], and 3) Classifier Chains [69], each with their own advantages and disadvantages.We consider the first two methods due to their simpler nature which is appropriate for the micro size problems.
PowerSet & OvA HD involve learning multiple binary classifiers, and hence, share a common implementation strategy.The difference between the two approaches lies in the way the class hypervectors are set up.We begin by encoding each instance in the dataset into a symbolic hypervector using Random Projection Encoding [24] as explained in Section III.We then perform one-shot learning, which involves learning the centroid hypervectors, followed by iterative fine-tuning, as detailed in Section III.The specific differences between the powerset and OvA approaches in this implementation are explained below.

A. POWERSET HD
The label powerset transformation method defines each unique combination of labels as a distinct class, represented by a binary class vector.This makes it possible to use standard single-label classification algorithms to train models on multi-label data.Formally, given an instance x with L possible class labels, this method creates a new binary class vector y of length 2 L , representing all possible label combinations.For each unique combination of labels C j , a binary label is assigned based on whether the combination is a subset of the original class labels of the instance x, as shown in Equation 1: where C j ⊆ y indicates that the class combination C j is a subset of the original class labels of the instance x.For example, if an instance has three possible class labels A, B, and C, then there are 2 3 = 8 possible combinations of labels: {∅, A, B, C, AB, AC, BC, ABC}.
While the label powerset method is simple and easy to understand, it suffers from the issue of class imbalance and scaling with the number of class labels, making it less practical for problems with large numbers of class labels.Nevertheless, it is still widely used as a baseline method for evaluating the performance of other more advanced multi-label learning methods.
To implement power set transformation with HDC, We create a centroid hypervector for every label combination resulting in 2 L centroid hypervectors.Retraining is done on each centroid hypervector individually as an independent binary classifier.During inference, we encode the test instance and compare it with each of the centroid hypervectors using a similarity check function.The closest centroid indicates the relevant label combination.

Compute Realization Cost of PowerSetHD:
To estimate the storage size of the HDC model, we need to calculate the total number of hypervectors required to represent all possible label combinations, and then multiply that by the size of each hypervector in bits.The number of possible label combinations for a dataset with L labels is 2 2L , since each label can either be present or absent in a given combination.Let's consider the case of Delicious dataset where number of labels is L = 983, then the number of possible label combinations is 2 9 83.To represent each hypervector as a 16-bit integer, we need 16 bits or 2 bytes per element.Since each hypervector has 1024 elements, the size of each hypervector in bytes is 2 × 1024 = 2048 bytes.Multiplying the number of hypervectors by the size of each hypervector gives us the total storage size required for the HDC model: 2 983  × 2048 bytes/hypervector, which is equivalent to 1.4×10 269 Terabytes.This is an enormous amount of storage, far beyond what is currently feasible with modern computing technology.It highlights the scalability issues of the label powerset method, which becomes impractical for problems with large numbers of class labels.In addition to this high RAM requirements, to get the full ranking we would have to evaluate 2 983 classifiers which would take many CPU cycles for even a single data point.
B. ONE-VS-ALL (OvA) HD One-vs-all is a problem transformation method used in multi-label classification where the problem is transformed into multiple binary classification problems.In this method, a separate binary classifier is trained for each label, where each classifier predicts whether the instance belongs to the corresponding label or not.Formally, given an instance x with L possible class labels, the one-vs-all method creates L separate binary class vectors y 1 , y 2 , . . ., y L of length 2 that represent the presence or absence of each class label.For each binary classification problem i, a binary label is assigned as follows: where y j is the label vector of the instance x.Given an instance x with L possible class labels, the OVA method creates L binary class vectors y 1 , y 2 , . . ., y L , where y i indicates whether the instance belongs to the i th class or not.The i th classifier is trained using the binary class vector y i as the target variable, and the output of the i th classifier is interpreted as the probability of the instance belonging to the i th class.
The one-vs-all method is computationally efficient and scales well with the number of class labels, making it suitable for larger-scale multi-label classification problems.However, it suffers from the issue of label correlation as it treats each label independently, ignoring any correlations that may exist between them.
OvA HD approach involves the creation of two centroid hypervectors for each label, resulting in 2L labels.The two hypervectors for each class denote the positive and negative associations of that label.Together, the pair of hypervectors represent the binary classifier for a single label.During inference, we encode the test instance and evaluate it using our L binary classifiers, each of which predicts the relevance of its corresponding label.The predictions of all classifiers are then combined to give the final inferred label vector.
Compute Realization Cost of OvA HD: For the OvA HD approach, we need to create two centroid hypervectors for each label, resulting in 2L labels.For the example of Delicious dataset, there are 983 labels, so we need to create 1966 hypervectors in total.For a hypervector dimensionality of 1024, each hypervector will require 256 bytes of storage.Therefore, the total storage required for loading the HDC model can be calculated as 1966 hypervectors ×256 bytes which is 491.5 kilobytes, which is significantly smaller than PowerSetHD.
The complexity analysis for classifying a single data point using HDC depends on the number of labels and the dimensionality of the hypervectors.Since we are using the OvA HD approach with 983 labels and a hypervector dimensionality of 1024, the time complexity for classifying a single data point can be expressed as O(LD), where L is the number of labels and D is the hypervector dimensionality.In practice, the complexity may be higher due to the need to compute distances between the test instance and all hypervectors, as well as the need to combine the predictions of all binary classifiers.However, the OvA HD approach is computationally efficient and scales well with the number of class labels, making it more suitable for larger-scale multilabel classification problems compared to PowerSet HD.

V. TinyXML HD: EXTREME MULTI-LABEL CLASSIFICATION
Extreme multi-label classification (XMLC) [71], [72] represents a challenging variant of multi-label classification, where the task involves predicting a large number of labels for each instance in a dataset.The scale of the label space in XMLC can range from thousands to millions, making it extremely challenging for traditional multi-label classifiers to handle efficiently.This presents significant scalability and computational challenges, particularly compared to small multi-label classification problems, such as those described in the previous section, where the label set is relatively small.One specific variant of XMLC that has gained traction in various real-world applications, such as text categorization [40] and recommendation systems [73], is Extreme Multi-Label Text Classification (XMTC) [52].The goal of XMTC is to classify documents into a potentially large number of labels.In the rest of this paper, we describe TinyXML HD, which solves XMTC problem by leveraging hyperdimensional representations.

A. HYPERVECTORS FOR TEXTUAL DATA
The XMTC datasets offer text data in two forms: bag-ofwords representation or raw-text.The bag-of-words (BoW) is a widely used text representation approach in natural language processing (NLP) that represents a document as a collection of words with the frequency of their occurrences, disregarding the order of the words.In our TinyXML HD, if raw-text data is available, we leverage the Word2Vec [74], [75] embeddings for representing text; otherwise, we use bag-of-words.TinyXML HD BoW encoding projects BoW feature vector into a hypervector using Random Projection Encoding [24], as described in Section III.
Raw text data poses a challenge.A simple and meaningful strategy is to consider the compositional distributional semantics approach.Compositional distributional semantics is a method of representing the meaning of a sentence as a function of the meanings of its constituent words.This approach is based on the distributional hypothesis, which posits that words that appear in similar contexts tend to have similar meanings [76].Given a sentence S consisting of n words, represented as d-dimensional vectors w 1 , w 2 , . . ., w n , 108464 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
we can combine these vectors using a composition function f to obtain a sentence vector s: The composition function f takes the word vectors as input and returns a single vector representing the meaning of the sentence.There are various ways to define the composition function, such as averaging the word vectors and concatenating them [77], [78].
One approach for representing a sentence as a composition of words is to assign random symbolic hypervectors to each word in the dataset and then use compositional distributional semantics to obtain a sentence vector.Previous studies [1], [10] have employed this approach with varying degrees of success in various NLP tasks.However, a key issue with this approach is that it ignores the structural relationships between words.Models like word2vec [74], [75] address such issues by generating vector representations that capture semantic relationships between words in a meaningful way.

1) Word2Vec EMBEDDINGS FOR TinyXML HD
We leverage Word2Vec with Hyperdimensional encoding for learning in TinyXML HD.Word2Vec is a powerful method for creating distributed vector representations of words that capture semantic and syntactic aspects of natural language processing tasks [74], [75].Traditional approaches to representing words rely heavily on sparse one-hot vector representations, which are high-dimensional and lack the ability to capture the subtle nuances of word meanings.In contrast, Word2Vec's distributed vector representations encode semantic relationships between words by placing words with similar meanings closer together in the vector space [74], [75].
To encode an instance in TinyXML HD, we first obtain the Word2Vec embeddings of all its constituent words.We then project these embeddings into hypervector using Random Projection encoding described in Section III.Finally, to employ compositional distributional semantics, we bundle (⊕) the resultant hypervectors into a single hypervector that represents an instance.As mentioned in Section III, Random Projection Encoding preserves the euclidean distance, so this enables the generation of symbolic hypervectors that capture semantic information between words through their cosine similarity scores, resulting in expressive high-dimensional representations.Consequently, hypervectors for words that are similar will be proportionally similar and those of semantically dissimilar words would be dissimilar.In this way we are able to capture the complex relationships between words and obtain a richer representation that conveys more information about their semantic context.

2) TinyXML HD LABEL REPRESENTATION
We leverage HDC algebra to map and combine multiple labels in hyperdimensional space.Let L be the number of labels or symbols in the dataset, and H D be a D-dimensional hyperdimensional space with H = {+1, −1} for the MAP HDC model.We map each label to a hypervector in this space.The initialization of the label space Y 1...L involves assigning a random hypervector from a Binomial distribution to each label.Specifically, we initialize each label Y i by sampling from B(0.5) • 2 − 1.To obtain a high-dimensional representation of a label, we bundle the corresponding hypervectors of the labels present for an instance, denoted by Y p .We then combine these hypervectors using the hyperdimensional operator ⊕ to obtain a single hypervector representation for the instance x i , given by Eq. (V-A2).This operator is commutative, which allows us to bundle the hypervectors in any order without affecting the final result.

B. LEARNING WITH TinyXML HD
With both the inputs and outputs embedded in the same high-dimensional space, the next step is to learn a mapping f : x i ∈ H → y i ∈ H, where x i is the input hypervector and f outputs the hypervector that represents all the labels present for that instance.In this section, we present our proposed neural network based approach to learn this mapping function f .Our proposed approach presents a linear formulation over the HDC operators of binding and bundling, which allows for effective and efficient optimization using gradient-based methods.

1) OBJECTIVE FORMULATION
We decompose the original learning problem into multiple sub-problems to enhance its learnability.Let's consider an example where we break down the learning problem into two sub-problems, which we rewrite as (5) where f 1 maps the input instance to an intermediate hypervector H x , and f 2 maps H x to the output label hypervector H 2 .We define f 1 and f 2 using the HDC arithmetic operations of binding and bundling.In particular, we parameterize f 1 as where H conv is a hypervector to be learned.Similarly, we define f 2 as This approach considers the mapping between two hypervectors as a series of geometric transformations where the input hypervector is bound sequentially with the intermediate hypervectors induced by the sub-problems.The hyperparameters of the number of sub-problems to induce and the dimensions of the learned hypervectors are chosen based on the complexity of the dataset.

2) 1D CONVOLUTIONS AS HYPERVECTOR OPERATORS
In developing a neural architecture, it is crucial to adhere to the principles of high-dimensional computing (HDC), which dictate that representations should be distributed, with individual coordinates devoid of semantic information.Consequently, our neural architecture should interpret inputs as distributed representations rather than feature vectors containing discrete semantic entities.To achieve this, we use one-dimensional convolutional operators as hypervector operators.A one-dimensional convolution involves applying a filter across an input, using a single set of weights to process the entire hypervector.This operation treats each vector region independently with the filter, synthesizing a vector that encapsulates information from the input.In contrast, a fully connected (FC) network utilizes an interconnected network of connections to process all coordinates of the input vector, treating the coordinates as dependent entities, which contravenes the principles of HDC representations.Figure 1 details the architecture and operations of our proposed ConvHD block.Our ConvHD block consists of three layers parameterized by C, which represents the expansion factor and F, the filter size.The block consists of three convolutional layers: the first layer (X 1 ) takes input hypervectors and generates C hypervectors, the second layer (X 2 ) processes these C hypervectors to produce C/2 hypervectors, and the third layer (X 3 ) combines these C/2 vectors into a single hypervector.A single convolutional unit can be defined as follows: The ConvHD block is be represented by: Hence a model f using 2 ConvHD blocks is represented by: We formulate the sub-problems as a single learning problem, where we optimize the parameters X 1 , X 2 , X 3 using gradientbased methods.To enhance our architecture, we incorporate the idea of dilated convolutions [79] to increase the receptive field of the convolutional layers [80].We also set the filter size F to be large, approximately a quarter of the hypervector dimensionality D. These details are crucial, as they increase the effective receptive field with every 1-D convolution operation.That is, they increase the number of hypervector coordinates in the input that influence the synthesis of a single coordinate in the output of the last convolution layer.By using a large filter size, we increase the number of coordinates in the input hypervector that are considered to produce a single coordinate in the resultant hypervector.Similarly, dilation helps to increase the receptive field by allowing deeper layers to infer coordinates based on a larger area of the input hypervector.Since the ConvHD operator uses three 1-D convolutional layers, the receptive field increases progressively with each layer looking at a larger section of the input hypervector to make a decision.The expansion factor C spawns more sub-problems parallely.For instance, in the above example of breaking down the learning problem into 2 parts, if we set C = 2, then each layer estimates 2 sets of sub-problems.The first layer will parallely solve two sub-problems similar to Equation 8and similarly the second layer will solve two sub-problems similar to Equation 9.The results of the 2 sub-problems will then be combined through the bundling operation.
In order to learn the mapping to solve these sub-problems, we use the loss function detailed in [18], which aims to minimize the cosine distance between the predicted hypervector and the ground-truth label hypervector.

VI. EVALUATION OF OvA, POWERSET AND TinyXML HD
This section of our research paper presents the results of our proposed multi-label classification approach on various real-world datasets.Through a series of experiments, we demonstrate the trade-off between compute efficiency and accuracy of our approach across a range of complexity levels, from small-scale (less than 20 labels) to extremescale (greater than 5000 labels).Our findings indicate that, in low-complexity scenarios with datasets of low cardinality, the One-vs-All HDC approach achieves high accuracy and efficiency.Conversely, the PowerSet HDC approach 108466 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
provides poor trade-offs, yielding benefits only when the label cardinality is very low, with efficiency degrading exponentially as complexity increases.
We evaluate the effectiveness of our proposed approach, TinyXML HD, on extreme size datasets.Our experiments demonstrate that TinyXML HD produces models that are 231x-836x smaller than state-of-the-art models while still achieving reasonable accuracy.Furthermore, our approach can efficiently train on large text datasets in just a few hours providing a speed up of up to 16x.These results highlight the potential of our proposed approach for solving extreme-scale multi-label classification problems while greatly reducing the computational resources required.
We evaluate the small scale problems on an Intel Xeon 24-core CPU while for the larger datasets we use a single Nvidia V100 GPU.

A. EVALUATION OF OvA AND POWERSET HD 1) OvA AND POWERSET HD EXPERIMENTAL SETUP
We tested our OvA HD and PowerSet HD multi-label methods on smaller size datasets by running on an optimized C++ implementation on an Intel Xeon 24-core CPU.We compare our HDC-based methods with multi-label versions of k-nearest neighbors (kNN) [47], Sequential Minimal Optimization -SMO [81], C4.5 [82], and Naive Bayes -NB [47], all of which are appropriate for smaller datasets.We utilized Java-based open-source Mulan [83] multi-label package with 3 small datasets for comparison: Genbase [84] contains protein classes of 27 most important protein families, with 662 samples, each with 1186 attributes.Scene [85] contains images with their characteristics and classes.One image can belong to up to 6 categories.It has 2407 samples, each with 294 attributes.Yeast [86] has information about a set of yeast cells.The task is to determine the localization site of each cell amongst 14 possible sites.It has 2417 samples, each with 103 attributes.
2) OvA AND POWERSET HD ACCURACY Figure 2 shows that OvA and PowerSet HD achieve comparable accuracy to state-of-the-art multi-label classifiers.PowerSet HD consistently outperforms state-of-theart methods on all three datasets.OvA HD is slightly less accurate on the Genbase dataset but performs better on the Scene and Yeast datasets, likely due to their better separability of HD space compared to low-dimensional space.

3) OvA AND POWERSET HD PERFORMANCE AND EFFICIENCY
While PowerSet HD achieves higher accuracy, Figure 2 demonstrates that this comes at a significant cost in terms of execution time.This is due to the exponential increase in class hypervectors as discussed earlier.portion of label combinations.Power Set HD is 24 times faster than state-of-the-art multi-label classifiers on average, or approximately two times slower than OvA HD, but offers 13% higher accuracy.For small datasets, where only a small subset of possible label combinations appear in the dataset, PowerSet HD can potentially be more efficient and accurate.However, for datasets with more number of possible label combinations, OvA HD is the clear choice as it offers a trade-off between compute efficiency and accuracy compared to PowerSet HD.These results indicate that the OvA HD approach is an ideal candidate for small scale multi-label classification tasks.

B. EXPERIMENTAL SETUP FOR TinyXML HD
We evaluated TinyXML HD HD on real-world, largescale datasets from Extreme Multi-Label Text Classification (XMTC).Our objective is to maximize the compute efficiency of learning while achieving comparable precision to the state-of-the-art.For the XMTC dataset, we evaluate our proposed TinyXML HD on Nvidia V100 GPU.

1) EVALUATION METRICS
We consider Precision@k with k = 1, 3, 5 as our metric for evaluating the performance of TinyXML HD on multilabel classification, where k represents the top k predictions.This is a widely accepted and used evaluation metric by other works in literature [59], [60].In addition, we evaluate the computational efficiency of TinyXML HD against the following start-of-the-art models: XT [87], Bonsai [88], SLEEC [49] and Parabel [89] for BoW datasets.For Raw text datasets, we consider these SoA models: AttentionXML [61], LightXML [60] and X-Transformer [59].Given that previous research has not given a comprehensive account of the compute cost associated with these models, it is difficult to establish a standardized metric for comparison.To address this issue, we have considered two distinct metrics: the count of trainable parameters and the training time.The former serves as an indicator of the cost of training, since a model with a higher parameter count requires more gradients to be calculated and optimized, and is also indicative of greater model size.The latter is a direct measure of the time required to train the model.These two metrics offer a meaningful evaluation of compute cost in the context of real-world applications.

2) DATASETS
In order to evaluate the expressiveness of our highdimensional representations of text data, we select six datasets from Extreme Multi-Label Text Classification (XMTC) dataset, a widely accepted benchmark in literature [18], [52], [71], [96].The datasets are described in Table 1.In addition to the scalability and computational challenges, the XMTC dataset poses an additional challenge which is the label sparsity issue.Bhatia et al [71] divided the datasets according to the number of labels per sample into small scale and large scale.Small scale datasets contain at most 5000 labels.Although pre-processed BoW features are available for all datasets, the original text is not.Consequently, we use the original text when available and BoW for all others.

3) TinyXML HD HD ARCHITECTURE SPECIFICS
We use Random Projection Encoding as discussed in Sec.III for BoW feature representations, while for raw text datasets, we utilize the combination of Random Projection Encoding with Word2Vec as described in Sec.V. We employ ConvHD blocks with expansion factor C = 128, filter size F = 255 with dilation set to 7 and a hypervector dimensionality of 1024.To optimize the model, we use the loss function proposed by Ganesan et al. [18] but we remove the negative loss component, which was intended to ensure that the output hypervector from the model f (.) is orthogonal to the labels that are not present for that instance.Since all labels are initialized with random hypervectors that are orthogonal to each other, enforcing the similarity to the present labels alone will automatically satisfy the orthogonality condition with the labels not present.Therefore, we only retain the positive component in the loss function, and we discard the additional positive p vector used in [18] as it does not improve results.The final loss function is as follows: L = c p ∈Y p (1 − cos(y i , c p )) where y i is the final hypervector output by our model f (x i ) for the i-th instance, and c p represents a present label.The loss function aims to minimize the cosine distance between y i and all present labels, thereby encouraging the model to produce a hypervector that is more similar to the labels that are present in the instance.

4) COMPARISON BASELINES
Our evaluation comprises two parts, with datasets divided by the type of features used.We consider different baselines for each part.For BoW datasets, we benchmark against other state-of-the-art models that use the same features, such as Bonsai [88], Parabel [89], and PFastreXML [97].For raw-text datasets, we compare TinyXML HD's performance against state-of-the-art deep learning approaches, including Atten-tionXML [61], X-Transformer [59], and Light-XML [60].These deep learning models employ powerful architectures like transformers, with hundreds of millions of parameters, enabling them to extract highly expressive embeddings from text data.As a result, TinyXML HD is inherently disadvantaged due to the significant disparity in parameter count.The primary objective of this research is to optimize size, speed and accuracy tradeoffs of such constrained HDC models to evaluate their viability as a lightweight paradigm.Hence, we aim to achieve reasonable accuracy with respect to the state-of-the-art, within a 10% margin.

C. EVALUATION OF TinyXML HD HD 1) TinyXML HD, POWERSET & OvA HD COMPARISON
To gain insight into the trade-off between performance and accuracy, we have evaluated TinyXML HD on small-scale multi-label classification tasks.In this study, we compare the performance of TinyXML HD to that of PowerSet and OvA HD on three small datasets, as described in Section VI-A.Given the lower complexity of the task at hand, we have scaled down TinyXML HD by utilizing a depth of 1, a block size of 8, and a filter size of 255.As the datasets used have low cardinality, we have evaluated our approaches using overall accuracy since the precision@K metrics are inapplicable for K = 3, 5, due to the limited number of labels per instance.In addition, considering the low complexity of the task, we evaluate performance only on CPU and do not use any specialized hardware for acceleration.
Our results in Table 2 show that TinyXML HD achieves 100% accuracy on Genbase [84], whereas the performance drops by 8% on Scene [85] and 3% on Yeast [86].The most likely reason for the lower accuracy on the two datasets is the scarcity of training data.Problem transformation techniques were trained faster than TinyXML HD, despite the latter's 108468 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.smaller size.The only exception to this was the Yeast [86] dataset, on which TinyXML HD was significantly faster (1.2x over OvA HD and 7.4x over Power Set HD).This is due to the disparity in label cardinality across the datasets.Genbase [84] and Scene [85] have label cardinalities of 1.25 and 1.07, respectively, meaning that only a single centroid vector needs to be updated for Power Set HD and OvA HD.However, the Yeast [86] dataset has a label cardinality of 4.2, requiring OvA HD to update 4 centroid hypervectors while Power Set HD needs to update 2 4  = 16 centroid hypervectors for each instance, resulting in increased training time.
The higher training time of TinyXML HD can be attributed to the convolution operations, which are computationally intensive compared to the HDC operations of bundling, binding, and similarity check used by the transformation methods.These operations reduce to simple element-wise additions and multiplications, making them easier to compute.Consequently, Power Set HD and OvA HD can be easily parallelized and accelerated in hardware, while the convolution operation of TinyXML HD would be harder to accelerate.
The dissimilarity in parameter count between the models is attributed to their respective architectures.The parameter count for PowerSet HD increases exponentially with the size of the label set.Conversely, the parameter count for the OvA HD scales linearly with the label set size, resulting in a parameter count that is twice the size of the label set for our implementation of the one-vs-all classifier.In contrast, TinyXML HD leverages only one hypervector to represent each label, with the additional parameters solely corresponding to convolution filters.These parameters are minimal in comparison to the label set size, further underscoring the efficiency of the TinyXML HD architecture.
The current study has revealed that the HDC-based problem transformation approaches offer a significantly superior trade-off between training time and accuracy for small-scale multi-label classification tasks compared to TinyXML HD.Specifically, for datasets where only a limited subset of possible label combinations appear in the dataset, PowerSet HD exhibits the potential to be both more efficient and accurate.In contrast, for datasets with a larger number of possible label combinations, albeit less than at the extreme scale, OvA HD proves to be a more promising candidate.While TinyXML HD boasts a smaller parameter count, the parameter count of OvA HD remains comparable and is sufficiently small for the complexity scale under investigation.
Due to linear and exponential scaling of PowerSet HD and OvA HD, these methods are unsuitable for extreme-scale multi-label classification tasks.PowerSet HD is too large to implement, while OvA HD takes too long to train.We next evaluate TinyXML HD on extreme-scale multilabel classification.

2) TINYXML HD MULTI-LABEL ACCURACY
We investigate the performance of TinyXML HD on extreme scale datasets next.Table 3 presents the performance of TinyXML HD along with its respective baselines on the BoW dataset.Our findings reveal that for Mediamill and Wiki10-31K BoW, TinyXML HD's precision at top one (p@1) is within 5% of the state-of-the-art (SoA).However, we note that barring Mediamill, precision at top three (p@3) and precision at top five (p@5) is lower across all datasets when compared to the SoA.
Table 4 shows the performance of TinyXML HD and its respective baselines on the raw text datasets.Our findings reveal that TinyXML HD's precision is relatively lower for these datasets.While we observe that TinyXML HD achieves comparable performance on the Wiki10-31K dataset, the precision drops for Amazon-13k by 9%.As we attempt to retrieve more labels from the hypervector, the retrieval becomes less robust.Considering that Wiki10-31K has 31,000 labels with only 8 samples per label available, the performance of TinyXML HD (83%) is remarkable.While the performance of TinyXML HD is not extraordinary compared to the state-of-the-art, it is remarkable considering the fact that TinyXML HD relies on a simple encoding scheme.In contrast, the state-of-the-art models employ highly complex architectures with millions of parameters.This observation validates the potential of HDC to provide expressive representations of data at extraordinarily low compute costs.Moreover, unlike other models where the model size scales almost linearly with label size, TinyXML HD ensures that the output size of the model is fixed to the dimensionality of the hypervector D independent of label set size.
3) TINYXML HD CONVERGENCE Figure 3 showcases the convergence plots of TinyXML HD on three distinct datasets: Wiki10-31K (BoW) [94], Delicious [92], and Wiki10-31K (Text) [94].When examining the BoW datasets, namely Wiki10-31K [94] and Delicious [92], a notable trend emerges.The loss function exhibits a smooth decrease, punctuated by a slight, yet discernible, initial drop for Wiki10-31K.In stark contrast, the Wiki10-31K (Text) variant converges in fewer than 30 epochs and seemingly starts to overfit, but, the precision plots provide  further insights into this phenomenon.While P@1 and P@3 seem to have converged, a closer analysis reveals that P@5 continues to improve.This observation suggests that the model is still assimilating new information from the data.Although unable to enhance P@1 and P@3 further, the model's focus shifts to effectively ranking two additional labels.
TinyXML HD ROC and AUC For the readers' benefit, we also provide additional insights in the form of Receiver Operator Characteristics (ROC) Curves in Fig. 4a and the corresponding Area Under the Curve (AUC) values as shown in Fig. 4b.Although these metrics are typically employed for evaluating multi-class classifiers, considering mutually independent labels, we adapt them to our multi-label setting by treating each label as an independent binary classification problem.Due to computational constraints, we focus on the Delicious dataset for these metrics, as it contains a large number of labels.To ensure plot coherence, we present the ROC curve for 10 randomly selected labels.Furthermore, in order to comprehend the AUC, we visualize the distribution of AUC values obtained for each label across the dataset's 983 labels.These results show expected accuracy when treating multi-label problem as an independent binary classification problem.
TinyXML HD Overfitting To address the issue of overfitting, we explored the utilization of BatchNorm-2D [102], L1/L2 Regularization, and Dropout techniques [103].However, neither of these approaches proved to be effective.Additionally, we conducted experiments to further reduce the parameter count, but since the model already consisted of only 2.5M parameters, any additional reduction led to a decrease in accuracy.
TinyXML HD Robustness To briefly examine the ability of the method to perform when few labels are missing, for every sample we dropped one label with probability p.Our experiments show that up to p = 0.2 there is no accuracy degradation, showing robustness of TinyXML HD.

4) TinyXML HD COMPUTING EFFICIENCY
We compare the efficiency of TinyXML HD to the following state-of-the-art models: XT [87], Bonsai [88], SLEEC [49] and Parabel [89].Table 6 compares the training time and model size of TinyXML HD against the state-of-the art listed above on the Wiki10-31K BoW dataset.Remarkably, TinyXML HD trains in 10 mins with a minuscule model size of 19.8MB while achieving comparable precision on the dataset.We observe that TinyXML HD is 6.5x smaller than the smallest SoA (Bonsai [88]) and 56x smaller than the largest SoA (SLEEC [49]).Methods such as Parabel [89], Bonsai [88] build multiple probabilistic label trees and perform classification on each node which becomes computationally expensive very quickly.Consequently, TinyXML HD is 1.25x quicker than the fastest SoA (Parabel [89]) and 4x quicker than the slowest SoA (Bonsai [88]).Raw text training time and the number of parameters needed for Amazon-670K dataset is shown in Table 5.All the deep learning models necessitate several days to train on this extensive dataset.In stark contrast, TinyXML HD showcases a remarkable training speed of merely six hours, even though the dataset has over 670K labels and 130K training samples.TinyXML HD provides a speedup of 4x over the fastest SoA (AttentionXML [61]) and 16x over the slowest SoA (X-Transformer [59]) This exceptional speedup can be attributed to two crucial factors.First, the deep learning models rely on complex transformer models like BERT and RoBERTa, to extract highly expressive feature embeddings from data.In contrast, TinyXML HD employs a simple encoding scheme, that decomposes into highly parallelizeable operations.The bulky feature extractor is replaced by our lightweight HDC-based encoding, which demonstrates the expressiveness of these representations when used to encode relevant features.Second, the output dimensionality of deep learning models typically scales with the label set size (L).However, TinyXML HD ensures that the output size of the model is fixed to the dimensionality of the hypervector (D), where D ≪ L, irrespective of the label size.This unique feature allows for the reduction of the number of trainable parameters, thereby improving training efficiency and reducing the computational load.
These results clearly demonstrate the strength of HDC when it comes to computational cost of learning.HDC has enormous potential to make learning computations tractable and to dramatically cut down on training time with good accuracy.TinyXML HD is 836x smaller than the largest SoA (AttentionXML [61]) and 231x smaller than the smallest SoA (LightXML [60]).Considering that X-Transformer [59] uses an ensemble of transformers we suspect that X-Transformer would be larger than 5GB and would require 100 hours of effort to train [60] making it infeasible to compare with.

VII. CONCLUSION
In this work, we have presented novel approaches to Multi-Label classification using Hyperdimensional Computing (HDC), addressing the entire spectrum of complexity.For small scale Multi-Label classification, we proposed using HDC to implement two problem transformation methods: PowerSet transform and One-vs-All transform.Through rigorous evaluation, we demonstrated that in low complexity scenarios, OvA HD can provide up to 60x speedup in low cardinality datasets, while PowerSet HD can be up to 24x faster than SoA with comparable accuracy on datasets where few labels occur together, especially in low cardinality datasets.For the extreme multi-label classification problem, where label size is very large, we proposed a neuro-symbolic approach, TinyXML HD, that breaks down the learning problem into multiple sub-problems using hyperdimensional arithmetic and then uses gradient optimization to solve these sub-problems.Our results demonstrated that TinyXML HD can dramatically compute the computational complexity of multi-label learning on large-scale real-world datasets while achieving good accuarcy.TinyXML is 836x smaller than the largest SoA and 231x smaller than the smallest for text datasets and up to 16x faster to train.Similarly, for BoW datasets, TinyXML is 6.5x -56x smaller than SoA models while training being up to 4x quicker.
Figure 2 also shows that both OvA and PowerSet HD training are significantly faster than most other multi-label classifiers, with OvA HD being 60.8 times faster on average.PowerSet HD is only 3.5 times slower than OvA HD on datasets with a large

FIGURE 3 .
FIGURE 3. Loss and precision vs iteration for three datasets.TABLE 4. Multi-Label classificaiton performance on real text datasets: Comparison with state-of-the-art.

FIGURE 4 .TABLE 5 .
FIGURE 4. (Left) ROC plots for 10 different labels (Right) Density of AUC values across all labels from the Delicious dataset considering each label as a binary classification problem.

TABLE 2 .
TinyXML HD on small scale multi-label classification (normalized to TinyXML HD).

TABLE 3 .
Multi-label classification performance on BoW datasets: Comparison with state-of-the-art.