DivNet: Efficient Convolutional Neural Network via Multilevel Hierarchical Architecture Design

Designing small and efficient mobile neural networks is difficult because the challenge is to determine the architecture that achieves the best performance under a given limited computational scenario. Previous lightweight neural networks rely on a cell module that is repeated in all stacked layers across the network. These approaches do not permit layer diversity, which is critical for achieving strong performance. This paper presents an experimental study to develop an efficient mobile network using a hierarchical architecture. Our proposed mobile network, called Diversity Network (DivNet), has been shown to perform better than the basic architecture generally employed by the best high-efficiency models–with simply stacked layers–, regarding complexity cost and performance. A set of architectural design decisions are described that reduce the proposed model size while yielding a significant performance improvement. Our experiments on image classification show that compared to, respectively, MobileNetV2, SqueezeNet, and ShuffleNetV2, our proposal DivNet can improve accuracy by 2.09%, 0.76%, and 0.66% on the CIFAR100 dataset, and by 0.05%, 4.96%, and 1.13% on the CIFAR10 dataset. On more complex datasets, e.g., ImageNet, our proposal DivNet achieves 70.65% Top-1 accuracy and 90.23% Top-5 accuracy, still better than other small models like MobilNet, SqueezeNet, ShuffleNet.


I. INTRODUCTION
Convolutional neural networks have been widely used in many computer vision applications in recent years, achieving state-of-the-art results in numerous tasks such as image classification [1], object detection [2], video analysis [3], and others [4]. The performance on these different computer vision tasks is dramatically boosted by sophisticated architectures such as AlexNet [5], Network In Network (NIN) [6], VGG-Net [7], Inception Network [8], ResNet [9], and others [10].
Since 2012, the overall trend has been to increase the depth and width of the networks in order to improve performance [11]. However, such improvements often result in The associate editor coordinating the review of this manuscript and approving it for publication was Sotirios Goudos . a very large number of parameters [9]. Over-parameterized models have high computational complexity, which may be prohibitive especially with embedded sensors and mobile devices where computational and power resources are usually limited [12]. For many recent applications, such as Augmented Reality (AR), Self-Driving Cars, and Autonomous Devices, an important challenge is to handle recognition tasks under computationally restricted resources. On the other hand, real-time classification and search require fast inference models enabling quick and easy data processing. For these applications, in addition to accuracy, computational efficiency and small network sizes are crucially enabling factors. This is leading researchers to re-examine the issue of network complexity and propose more efficient architectures with fewer parameters and convolution operations.
MobileNet is an example of an efficient model with fewer parameters [13]. Specifically, MobileNet uses depthwise separable convolutions [15], [16] to build lightweight neural networks that trade off a reasonable amount of accuracy over model size and latency. These convolutional filters are the main component of subsequent works focusing on fast and lightweight networks. However, one drawback of these filters is the fact that the expensive pointwise convolutions result in a limited number of feature channels, i.e., feature maps, to meet feature-diversity requirements, which may significantly hinder performance. Additionally, these filters become less efficient in extremely small networks because of the costly dense 1 × 1 convolutions. In [14], a simple design called group-wise convolution is proposed to effectively reduce the number of parameters, while allowing sufficient feature diversity. The authors introduce a novel channel shuffle operation to help the flowing of information across feature channels and overcome the side effects brought by group convolutions. MobileNetV2 [17] achieves the best results by using resource-efficient inverted residuals and linear bottlenecks. Recently, [18] proposes a series of practical guidelines for efficient CNN design and achieves state-of-theart results among mobile networks. Despite all these efforts, the poor performance of these models compared to that of state-of-the-art deep CNNs remains an unresolved issue [19].
In this paper, we propose to overcome some drawbacks of the previously discussed approaches. We introduce a novel CNN model based on a hierarchical architecture, called DivNet. The goal is to train the network to extract relevant features of high diversity that best characterise the complexity of any input image. To achieve this, we consider a hierarchical structure that has been shown to perform better than the standard architecture -with simply stacked layers-in terms of complexity and performance [7], [8], [41]. Our DivNet model is built on top of a novel cell module, called Diversity Module (DivMod), designed to extract more discriminative and diverse features. This module takes as an input c feature channels that are first filtered with a standard convolutional filter. Features are subsequently split into k groups, each with the same number of channels. One group remains the identity, i.e., unmodified, while the remaining groups go through the same set of convolution operations. This efficient design can reduce the number of learnable parameters by ensuring that each set of convolutions are applied only to the corresponding input channel group. The resulting features are subsequently concatenated followed by a 1 × 1 convolution. The purpose of this second convolution is to recover the channel group dimensions so they match those of the identity group after a shortcut connection. After that, the k groups are concatenated and the number of channels remains the same. A channel shuffle operation [14] is then used to enable information flow among different groups. The batch normalization (BN) [16] and ReLU nonlinearity [17] are used before the split operator to prevent information collapse.
In summary, the contributions of our research are threefold. First, we propose DivMod, a novel cell module designed to extract more discriminative and diverse features. DivMod has an efficient lightweight design with high discriminative capacity and low computational cost. Second, based on DivMod, we propose DivNet, a hierarchical architecture that allows to construct efficient lightweight CNNs. This CNN model is particularly suitable for mobile designs. Third, we thoroughly evaluate DivNet to highlight its advantages, in terms of performance and computational cost, and compare its classification performance against several state-of-the-art lightweight neural networks. As evidenced in Section IV, the lightweight neural network we propose can offer promising results in terms of performance as well as efficiency.
The remainder of this paper is organized as follows. Section II briefly reviews the most successful works on mobile architectures. The proposed DivNet network is described in detail in Section III. Experimental results on several image classification datasets are given in Section IV. Finally, Section V concludes the paper.

II. RELATED WORK
Most successful works on developing efficient lightweight architectures fall into two main categories: one focuses on obtaining small networks from existing complex architectures, while the other focuses on developing novel lightweight architectures from scratch.

A. SIMPLIFYING COMPLEX ARCHITECTURES
Recent works on CNNs have yielded powerful deep and wide architectures compared to the early proposed deep neural networks, which use the highly paramaterized fully connected layers. In this context, [6], [9] propose to use the average pooling layers to significantly reduce the number of parameters. Another technique used by the GoogLeNet's inception module [8] consists of adding 1×1 convolutions for dimensionality reduction. [20] proposes to downsample the image at an early stage to reduce the size of the feature channels. Other approaches consist in taking an existing complex CNN model and compressing it in a lossy fashion. Several compression techniques have been introduced in the literature, including those based on parameter quantization [21], binarization [22]- [24], and pruning [25]. For example, [26] reports highly competitive compression ratios on AlexNet [5] and VGG-Net [7], without significantly hindering performance, by pruning those weights with small magnitudes and then retraining the network.
Other approaches to speeding up CNNs and work with smaller models are factorization [27], [28], and distillation [29]. The latter involves using a large network to teach a small network. Low-bit networks [30], [31] are emerging architectures that include a quantization process to map continuous real values to discrete integers, thus simplifying computation. Flattened networks [32] rely on fully factorized convolutions and exploit the potential of extremely factorized networks [33]. Subsequently, the Xception network [34] has demonstrated how to scale up depthwise separable filters to outperform Inception V3 networks.  The approaches discussed before focus primarily at optimizing large models to achieve state-of-the-art performance. However, such approaches have shown limits with regard to the difficulty of training and the rate at which performance drops when compression is increased [26]. The point at which accuracy starts to rapidly drop occurs even when the model is still very large [49]. Thus, the drawback when deployed to resource-constrained environments is that they often struggle to scale down sufficiently. To achieve this objective, expert knowledge and trial-and-error are required. Additionally, as the CNN architectures continue to become deeper, the computational costs of the convolutional layers remain high. Thus, models such as those above [8], [21]- [25], still contain millions of parameters and billions of FLOPs even after compression, which require significant computational resources to execute, and exhibit diminishing returns when scaling. Lightweight architectures based on simplifying complex models are, therefore, out of the scope of this paper. Thus, we focus on efficient models developed from the ground up [51].

B. LIGHTWEIGHT ARCHITECTURES
Recently, a few works attempting to design small and efficient lightweight architectures for mobile applications have emerged. For instance, [13] presents an efficient model called MobileNet (see Fig. 1). MobileNet uses depthwise separable convolutions [15], [16] to build lightweight neural networks by trading off a reasonable amount of accuracy for a model size reduction. In mathematical notation, let ∈ R c out ×D z ×D q is the corresponding set of c out output feature channels, each with a spatial size of D z × D q , (T ) ∈ R D c ×D d ×D j ×D j is a set of filter weights, and * denotes the convolution operation. T is an index ∈ {0, . . . , l − 1} such that T = 0 indicates the input layer in which X (0) is the input image, l indicates the total number of layers, and D j , D d and D c are the spatial dimension of each kernel, the total number kernels and the number of feature channels, respectively. The computational cost (i.e., the computational complexity), C s , of a standard convolutional layer with stride one can then be defined as: (1) As one can observe, the cost depends multiplicatively on the number of input channels, c int , the number of output channels, c out , the kernel size, D j × D j , and the input channel size, D h ×D w . The cost C dp of the depthwise separable convolution, which is the combination of a depthwise convolution and a 1 × 1 pointwise convolution, is [13]: One drawback of the depthwise separable convolution is that it becomes less efficient in extremely small networks because of the costly dense 1 × 1 convolution [34], [35]. Additionally, the expensive pointwise convolutions result in a limited number of feature channels to meet feature-diversity requirements, which may significantly hinder performance. To address this issue, [14] proposes a new architecture based on a low-cost pointwise group convolution and channel shuffle operation (see Fig. 1b).
The group convolution significantly reduces the computational cost by ensuring that each convolution operates only on the corresponding input channel group. The size S s of a standard convolutional filter with n input feature channels and m output feature channels is: Now, let us consider the group convolution introduced in [14] with k groups. The size S g of each group filter is: As a result, the cost of each group convolution, C g , is: One limitation of the group convolution is that the output feature channels are only derived from a small fraction of the input features channels. To address this issue, [14] proposes a channel shuffle operation that allows information flow among channel groups and improve representation.
Another small network is MobileNetV2 [17], which has been shown to improve upon the state-of-the-art by using resource-efficient inverted residuals and linear bottlenecks (see Fig. 2). In turn, SqueezeNet [19] uses a bottleneck approach to design a very small network (see Fig. 2b). The authors propose a Fire module that consists of a low cost 1 × 1 convolutional layer to reduce the size of feature channels and an expanded layer that has a mix of 1 × 1 and 3 × 3 convolutional filters. This allows achieving AlexNet-level accuracy with 50 times fewer parameters. Recently, [18] proposes a series of practical guidelines for efficient CNN design and has been shown to achieve stateof-the-art results among mobile networks.
In contrast to MobileNet, MobileNetV2, and ShuffleNet, which are dependent on depth-wise separable convolutions, Wang et al. [48] propose a variant of DenseNet named PeleeNet, which instead uses conventional convolutions. PelleNet consists of a stem block and four stages of feature extraction. The stem block can effectively improve the feature extraction capabilities without adding too much computational cost. The authors use two stacked convolutional layers with a 3 × 3 kernel to get different scales of the receptive field. Another important contribution is that the number of channels in the bottleneck layer varies according to the input shape. Another recently proposed model is IoTNet [49], which improves the trade-off between accuracy and computational cost by factorizing the 3 × 3 standard convolution commonly found in large models into pairs of 1 × 3 and 3 × 1 convolutions, rather than using depth-wise separable convolutions. In a similar way, the authors of EffNet [51] address the computational complexity associated with bottleneck layers by replacing 3 × 3 convolutions with pairs of 1 × 3 and 3 × 1 convolutions performed as a depth-wise separable operation. A weakness of such an approach is that the computational complexity savings of performing a 1 × 3 convolution as a depth-wise operation is less than the cost of a 3 × 3 convolution. On the other hand, LiteNet [50] takes an inception block, which contains 1 × 1, 1 × 2 and 1 × 3 standard convolutions arranged side by side, and makes modifications by replacing half of the 1 × 2 and 1 × 3 standard convolutions with their depth-wise equivalents. Their proposed block, therefore, contains a mix of both standard and depth-wise separable convolutions. Their work also makes use of a SqueezeNet fire block [19] to further reduce the total number of learnable parameters. The drawback of their proposed model is the side-by-side structure employed, which increases the total number of parameters according to the depth.
More recently, smaller and faster neural network architectures have been developed in a fully automated manner [36], [37]. These methods focus on cell level architecture search. Once a cell structure is determined, it is used in all layers across the network. One drawback of these approaches is that only a small dataset is used on the optimization process, which often results in a highly complex model [38].

III. FRAMEWORK
This section first details the different components of the proposed cell module, DivMod, followed by a description of our DivNet architecture.

A. PROPOSED CELL MODULE
State-of-the-art networks, such as SqueezeNet [19], MobileNet [13], MobileNetV2 [17], and ShuffleNet [14], introduce efficient cell modules into the building blocks to provide an excellent trade-off between feature representation capability and computational cost. Here, the challenge is to design a cell module that achieves the best performance under limited computational resources [39]. For this purpose, there exist two main techniques used by many efficient neural network architectures: depth-wise separable convolutions and group-wise convolutions [13], [14], [17], [34]. Depth-wise separable convolutions decompose the standard convolution into two separate operations: depthwise convolution and pointwise convolution to reduce the computational complexity of the 1 × 1 convolutions (see Fig. 3). Group-wise convolution [14] is used to further reduce the model size.
Following [18], we propose a novel cell unit, called DivMod, based on an efficient channel split technique. DivMod aims to increase the feature diversity, which is one of the main factors influencing the discriminative ability of CNNs. DivMod should be seen as complementary to the cell modules used by MobileNetV2 [17], ShuffleNet [18], and ShuffleNetV2 [14], as it allows to further improve performance.  As illustrated in Fig. 4, DivMod contains two 1 × 1 convolutions, one depthwise 3 × 3 convolution, and one ReLU activation function. A 1 × 1 convolution is first used to reduce the dimensionality of the data before the expensive depthwise 3 × 3 convolutions. The ReLU activation function follows the first 1 × 1 convolution and is applied before the split operation to avoid information collapse [14]. Experiments illustrate that using ReLU outside the convolution operations is crucial to prevent non-linearities from destroying information [17]. The c feature channels resulting from the 1 × 1 convolution operation are split into k groups with the same number of channels each. In our experiments, we fix k = 4 as a trade-off between the model size and performance, thus reducing the number of parameters by 4 times compared to the case of not using such a group convolution. One group is used as the identity, and each of the remaining k − 1 groups go through a 3 × 3 depthwise convolution. Then, the k − 1 groups are concatenated and another 1 × 1 convolution is applied. Two sets of features are then concatenated into one: the first set is the identity group (see skip connection in Fig. 4), and the second one is that resulting from concatenating the remaining k−1 groups. The two 1 × 1 convolutions have the same number of input and output feature channels and hence the number of input and output channels over the whole cell module is the same [14]. The channel shuffle operation is then used to enable information flow among the k groups (see Fig. 5). Specifically, first we divide the channels in each group into several subgroups. Then we feed each group with different subgroups. It is important to note that the shuffle operation does not take any additional parameters and is computationally efficient.

B. PROPOSED ARCHITECTURE
Recent lightweight mobile models [13], [14], [17], [19] focus on cell level architecture search. Once the cell structure is determined, it is used in all layers across the network. These approaches do not permit layer diversity, which has been shown to be critical for achieving both consistent performance and lower latency of the overall network [39]- [41]. As suggested in [42], there is a need to produce feature channels with high diversity across layers to obtain better performance-latency trade-offs. However, designing deeper and more complicated networks does not necessarily lead to more efficient models with respect to size and speed [39]. To overcome this drawback, we propose DivNet, an efficient architecture based on a multi-level hierarchical design. Fig. 6b shows the baseline structure of DivNet for classification. One can observe that our architecture, in contrast to standard CNNs (Fig. 6a), principally consists of multi-structural paths to enrich the information flow among sub-networks, which is crucial to enhance discriminative abilities. In addition, our hierarchical architecture allows for different types of blocks over the network levels, which enhances the diversity of layers.
The architecture of DivNet is summarized in TABLE 1. The basic building block is our DivMod cell unit, discussed in the previous section. DivNet contains an initial standard convolutional layer with 32 filters, followed by 4 DivMod cell layers. Each DivMod cell layer consists of a number of DivMod blocks. Each DivMod block comprises a number of DivMod cell units, each with the same architecture as explained before. Unlike [18], the number of DivMod blocks in each layer is fixed to generate networks with a progressive complexity. Earlier layers in CNNs usually process large amounts of data and thus have a much higher impact on inference latency than later layers. On the other hand, midlevel and last-level features have been shown to considerably  impact performance [43], [44]. Accordingly, given an initial set of features extracted by a standard CNN, DivNet increases the diversity of such features progressively from the initial to the last layer. The architecture progressively grows to construct the whole network. The high efficiency in each building layer enables using more feature channels and thus increasing network capacity. The channel shuffle operation is also used to enable information flow among the DivMod blocks. Grouped feature channels generated from the previous DivMod cell layer are stacked. Then, the channels in each group are divided into several subgroups, and feed each group in the next layer with different subgroups. Such an architecture allows to increase the diversity of the extracted features compared to traditional CNNs. It is important to note that the shuffle operation does not take any additional parameters and is computationally efficient.

IV. EXPERIMENTS
In this section we discuss experiments conducted to evaluate the classification performance of the proposed DivNet model on the CIFAR10, CIFAR100, and ImageNet 2012 datasets [5], [45], often employed in the literature. First, we evaluate the effect of various DivNet architecture design decisions on performance. We analyze the trade-offs between accuracy and computational cost in terms of model size and floating-point operations (FLOPS) by tuning several hierarchical structure designs. Next, we compare the effectiveness of DivNet against the popular network models MobileNet [13], MobileNetV2 [17], SqueezeNet [19], ShuffleNet [14], and ShuffleNetV2 [18].

A. TRAINING SETUP
Our experiments are performed on the CIFAR10, CIFAR100 [45] and ImageNet 2012 datasets [5]. CIFAR10 consists of 60,000 32 × 32 color images divided in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. CIFAR100 consists of 60,000 32 × 32 color images divided in 100 classes, with 600 images per class. ImageNet is organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by several images. We train our models using Tensorflow [46]. We use the standard NADAM optimizer [47] with default parameter values.

B. PERFORMANCE EVALUATION METRICS
To evaluate the performance of DivNet, we employ the Top-1 accuracy, the Top-5 accuracy, the number of FLOPS, and the number of multiply accumulate operations (MAC).
The Top-5 accuracy is computed by determining if the correct label is among the top-5 predicted labels.

C. DivNet ARCHITECTURE DESIGN
In this section, we design and execute experiments with the goal of providing intuition about the performance of DivNet with respect to the design strategies discussed in Section III. Specifically, our goal is to investigate the impact of design choices on accuracy, model size and computational cost. To this end, we explore the architecture designs related to the high-level connections among DivMod cell units. To build an efficient mobile model, first we evaluate several split operator strategies as shown in Fig. 6b for boosting performance. Second, we evaluate the performance of the model with/without shuffle operations.  Fig. 6b has split rates of S 1 = S 2 = S 3 = 4 and S 4 = 8. As one can observe, large split rates (e.g., S = 8) consistently boost classification scores. Performance improves with split rates progressively increasing from the input to the last layer. Specifically, these results show that using a high split rate for the mid and last layers gives the best performance with low computational cost. For instance, using a split operator S = 8 for the mid-layers achieves 60.51% accuracy, higher than using S = 8 for early layers (+4.43%) on the CIFAR100 dataset, and 79.85% accuracy on the CIFAR10 dataset, better than using S = 8 for early layers (+9.27%). The architecture with a high split rate in the last layer, i.e., S = 8, gives the best performance, i.e., 70.17% and 94.35% on the CIFAR100 and CIFAR10 datasets, respectively, and outperforms the first and the second architectures by 23.77% and 14.5% (CIFAR100), and by 14.09% and 9.66% (CIFAR10), respectively. In this case, DivNet only requires 489.897.672 FLOPS, fewer than the first and the second architectures by 193,378,304 FLOPS and 240,694,272 FLOPS, respectively. It is then clear that DivNet benefits from the hierarchical network design.

2) CHANNEL SHUFFLE OPERATION
The purpose of the shuffle operation is to encourage the information flow among channels and is important to improve DivNet's performance of DivNet. TABLE 3 compares the performance of the third DivNet architecture tabulated in TABLE 2 when using/removing the channel shuffle operation. The evaluations are performed under four different setups: Shuffle 1 is applied exclusively, Shuffle 2 is applied exclusively, Shuffle 3 is applied exclusively, and no shuffle operation is applied. As one can observe, using the shuffle operations improves significantly the classification performance on both datasets, i.e., CIFAR10 (94.35 %), CIFAR100 (70.17 %), compared to the case of not using this operation, i.e., CIFAR10 (+2.00%), CIFAR100 (+3.2%). In particular, on the CIFAR10 dataset, DivNet with Shuffle 3 attains 93.65 % accuracy, better than that attained with Shuffle 1 (+0.65%) and Shuffle 2 (+0.35%). On the CIFAR100 dataset, DivNet with Shuffle 3 attains 69.07 % accuracy, better than that attained with Shuffle 1 (+1.70%) and Shuffle 2 (+1.4 %). From these results, it is clear that the proposed DivNet model performs better with the shuffle operation. Additionally, using feature channels with high diversity across mid-level and last-level works especially well for more challenging datasets, i.e., CIFAR100, resulting in important performance improvements.
TABLE 5 tabulates the performance of DivNet on the more challenging ImageNet 2012 dataset. In order to handle the complexity of this dataset, we evaluate more complex architectures versions of the state-of-the-art lightweight neural networks tabulated in TABLE 5. For a fair comparison, in addition to the accuracy, we consider two other metrics: FLOPs and the inference time speed (CPU speed). The latter is measured in milliseconds (ms) and depends on the platform characteristics.
As shown in the DivNet's inference time speed is 70.55ms, 42.45ms shorter than MobilNet, 72.45ms shorter than MobilNetV2, and 5.05ms shorter than ShuffleNet. PeleeNet outperforms our DivNet model, i.e., +1.95% (Top-1) and +0.37%(Top-5), but with higher FLOPs (+19M). Compared with ShuffleNetV2, DivNet performs better in terms of both Top-1 (+1.29%) and Top5 (+1.91%) accuracies, requiring 0.5 M less parameters but with higher computational cost, i.e., DivNet's FLOPs are +190 M greater and inference time speed (+32.75 ms) higher. However, it should be noted that FLOPs is a metric that indirectly approximates the computational complexity and it may not always be equivalent to the direct metric based on CPU speed (as one can confirm by comparing ShuffleNetV2 and DivNet). This discrepancy between the indirect (FLOPs) and direct (speed) metrics can be attributed to the degree of parallelism [18]. A model with high degree of parallelism could be much faster than another one with low degree of parallelism, under the same FLOPs. Therefore, we anticipate that the inference time of our hierarchical structure based-model could be improved if it were implemented on effectively designed parallel algorithms. In particular, the inherent multi-branch nature of DivNet enables efficient inference and supports the parallel execution on limited computing resources. In this sense, we see in TABLE 2 that, not only the number of parameters, but also the architecture is critical to achieve a good performance: the third architecture with a high split rate in the last layer, i.e., S = 8, gives the best performance, i.e., 70.17% Top-5 accuracy and 94.35% Top-1 accuracy on the CIFAR100 and CIFAR10 datasets, respectively, and outperforms the second architecture by 14.5% (CIFAR100), and by 9.66% (CIFAR10), with a smaller number of parameters (−59392). DivNet benefits from the hierarchical network design, which is the main contribution VOLUME 9, 2021 of our work. On other hand, existing automatic neural architecture search methods for mobile devices are computationally expensive because the design space is combinatorically large. A network with a hierarchical design, in conjunction with an exhaustive search for the most appropriate type and kernel size and number of filters in each layer, is expected to lead to better accuracy and efficiency.

V. CONCLUSION
This paper presented an efficient lightweight CNN based on a hierarchical design, called DivNet. Our goal is to produce feature channels with high diversity across layers based on an efficient channel split operation to obtain better performance-complexity trade-offs. DivNet employs a novel cell unit, called DivMod. The split operation strategy in combination with the DivMod cell units consistently allow extracting relevant features that best characterize the complexity of the input image. Such feature extraction can greatly increase diversity, which is one of the main factors influencing the discriminative ability of neural networks. Experimental results illustrated the potential of DivNet to improve classification performance with low computational cost. On the CIFAR100 dataset, DivNet achieved 70.17% accuracy, whereas only 68.64% and 68.08% accuracies were achieved by VGG11 bn and MobileNetv2, respectively. Similarly, on the CIFAR10 dataset, DivNet outperformed most of the other evaluated state-of-the-art models and attained 94.35% Top-1 accuracy. On more complex datasets, i.e., Ima-geNet 2012, DivNet achieved a Top-1 accuracy of 70.65% and Top-5 accuracy of 90.23%, and outperformed other small models such as MobilNet, SqueezeNet and ShuffleNet.
HADRIA FIZAZI received the Ph.D. degree in industrial computer science and automation engineering from the University of Lille 1, France, in 1987. She is currently a Professor with the Faculty of Computer Science, University of Science and Technology of Oran Mohamed Boudiaf. Her research interests include cover image processing and the application of pattern recognition to remotely sensed data. Professor with the Department of Computer Science, University of Warwick, U.K. His research has been funded by the Consejo Nacional de Ciencia y Tecnologia, Mexico, the Natural Sciences and Engineering Research Council, Canada, the Canadian Institutes of Health Research, the FP7 and the H2020 Programs of the European Union, the Engineering and Physical Sciences Research Council, U.K., and the Defence and Security Accelerator, U.K. He has authored several technical articles, book chapters, and a book in these areas. His main research interests include signal and information processing and machine learning with applications to multimedia analysis, computer vision, image and video coding, security, and communications.