Image Aesthetic Assessment: A Comparative Study of Hand-Crafted & Deep Learning Models

Automatic image aesthetics assessment is a computer vision problem dealing with categorizing images into different aesthetic levels. The categorization is usually done by analyzing an input image and computing some measure of the degree to which the image adheres to the fundamental principles of photography such as balance, rhythm, harmony, contrast, unity, look, feel, tone, and texture. Due to its diverse applications in many areas, automatic image aesthetic assessment has gained significant research attention in recent years. This article presents a comparative study of different automatic image aesthetics assessment techniques from the year 2005 to 2021. A number of conventional hand-crafted as well as modern deep learning-based approaches are reviewed and analyzed for their performance on various publicly available datasets. Additionally, critical aspects of different features and models have also been discussed to analyze their performance and limitations in different situations. The comparative analysis reveals that deep learning based approaches excel hand-crafted based techniques in image aesthetic assessment.

generation and processing including medical & healthcare, 23 information & communication technologies, infotainment, 24 edutainment, and safety & security etc. For example, it can 25 be employed to benchmark the algorithms for image noise 26 The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . removal and image restoration as well as for monitoring of 27 quality of service (QoS) in systems where images are digitally 28 compressed, communicated, and decompressed. 29 Underwater image enhancement and restoration systems 30 can also benefit from aesthetic assessment techniques [1]. For 31 instance, in underwater systems [2], image quality assessment 32 can be used, and image enhancement approaches can be 33 applied to improve quality and accuracy in case of low-quality 34 image as input [3], [4] [5]. Moreover, image quality assess-35 ment can be utilized in robotics, where a robot automatically 36 assesses the image quality and change focus and position 37 to recapture the image if the quality metric is below some 38 recommended level. 39 Due to its significant application potential in the 40 rapidly increasing digital camera and photography industry, 41 automated image aesthetic assessment has recently gained 42 These methods are pioneering image aesthetic approaches 104 and provide a naive methodology for accuracy. 105 1) An intelligent photographic interface is proposed by 106 Lo et al. [13], with on-device aesthetic quality assess-107 ment for bi-level image quality on general portable 108 devices (Figure 3(a)). In this framework, photographic 109 rules were followed, and a three-layered structure 110 was designed. Using hand-tuned techniques, the first 111 layer extracted composition, saturation, colour combi- 112 nations, contrast, and richness features. In the second 113 layer, an independent SVM classifier [37] was trained 114 for each feature perspective to obtain the feature index. 115 Moreover, the SVM classifier is trained to get the aes- 116 thetic score in the last layer. The mentioned framework 117 is tested on CUHK [38] dataset, comprising 2078 high-118 quality and 7573 low-quality images, providing an 119 accuracy of 89%.
120 2) A computational algorithm using region-based features 121 and k-means clustering is presented by Datta et al. [14]. 122 Colour segments are extracted from the image utiliz-123 ing region-based features and texture information to 124 assess the quality of images with the connected com-125 ponent technique. Subsequently, the SVM classifier on 126 the extracted feature is trained to categorize images 127 into high and low aesthetic categories. A regression 128 model [39] is also trained to obtain a regression score. 129 The dataset is collected from a photo-sharing website 130 consisting of 3581 images.

131
3) To access the quality of digital portraits, Redi et al. [15] 132 introduced a technique based on composition, scene 133 semantics, portrait-specific features, correct perception 134 of signal and fuzzy properties, and the five essential 135 features extracted from images. One should note that 136 composition rules are the essential and basic photog-137 raphy rules, including sharpness, spatial arrangement, 138 lighting, texture, and colour. The semantic contents 139 represent the overall photography depiction, including 140 high-level features [40]. The correct perception of sig-141 nals includes noise, contrast quality, exposure qual-142 ity, and JPEG quality, while portrait-specific features 143 include face position, face orientation, age, gender, eye, 144 nose, mouth position, foreground, and background con-145 trast. Fuzzy properties are originality, memorability, 146 uniqueness, and emotion depiction. LASSO reg-147 ression [41] is applied to the extracted composi-148 tion features, learning regression parameters for every 149 feature group. Moreover, a correlation between the 150 predicted score and the original aesthetic value is com-151 puted. Using regression on all features, a final aesthetic 152 FIGURE 1. The sample images are taken from the Aesthetics and Attributes Database (AADB) [12], consisting of various photographic imagery of real scenes collected from Flickr. For each image, the rating is provided by averaging five rater's score as a ground-truth score. Eleven aesthetic attributes were considered while curating the dataset, such as interesting content, object emphasis, good lighting, colour harmony, vivid colour, shallow depth of field, motion blur, rule of thirds, balancing element, repetition, and symmetry. The photos a) and b) represent photos with a high aesthetic score of 1.0 and 0.7, while c) and d) represent the photos with a low aesthetic score of 0.4 and 0.1, respectively.  6) An online photo-quality assessment and photo selec-205 tion system is present in [18] as shown in Figure 3(d), 206 where the users post their images, and the algorithm 207 provides aesthetic evaluation and editing recommen-208 dations. The cropping-based editing algorithm uses 209 composition features and composition optimization 210 VOLUME 10, 2022 for the proposed system inputting a single image or background. 245 1) The landscape photo assessment algorithm by 246 Yang et al. [20] is shown in Figure 4(a in [21], the architecture of their proposed framework is 264 in Figure 4(b). The image is edited if the appeal factor is 265 lower than the computed aesthetic score (i.e., between 266 1 to 5

316
In this section, we summarize the algorithms that consider 317 both local and global features learned from images for aes-318 thetic assessment.  1) The multi-label task for assessing the aesthetic quality  2) The BLIINDS-II algorithm [26] employs discrete 340 cosine transform (DCT) [68], [69] is given in 341 Figure 5(    1) Aesthetic quality is highly based on the local 401 region's sum of colour harmony scores accord-402 ing to Nishiyama et al. [30], implementing bag-of-403 colour patterns for photograph quality classification 404 (see Figure 6(a)). 3) The content-based photo quality assessment, abbrevi-447 ated as (CPQA) [32], deals with both regional and 448 global features concerning three different areas, includ-449 ing clarity-based detection, layout-based detection, and 450 human-based detection.  Rongju et al. [35] Figure 7. 507 The aesthetic quality assessment of photographs can be for-508 mulated as a classification or regression or the combination 509 of classification and regression approaches. There is a lack 510 of consensus on the definition of aesthetic quality as it is a 511 subjective matter. However, the photo-sharing communities 512 rated the photos, and the average score is usually taken as the 513 VOLUME 10, 2022 quality of the images and used as ground truth for different 514 algorithms. Therefore, the quality of the assessment task is 515 taken as a classification problem. Nevertheless, the problem 516 can also be formulated as a regression to regress the quality 517 of photographs to aesthetic score. Thus, the aesthetic quality 518 assessment feature could be either extracted as a hand-crafted 519 or learned using deep learning architecture in multi-task 520 settings. The multi-task approaches tend to learn better and 521 improve the aesthetic score significantly.   2) MULTI-TASK CONVOLUTIONAL NETWORKS 550 A Multi-task learning approach is employed to explore 551 the correlation between automatic aesthetic assessment and 552 semantic information. The task is to utilize semantic infor-553 mation in the joint objective function to improve the quality 554 assessment task [102]. The approach provides both aesthetic 555 and semantic labels as output. A Multi-Task Convolutional 556 Neural network (MTCNN) [83] is designed that performs 557 both semantic recognition and quality assessment consid-558 ering an input image size of 227 × 227. The proposed 559 CNN automatically learns the relation between semantics 560 and aesthetics. Their CNN consists of five convolutional 561 layers, three pooling layers, and three fully-connected layers. 562 The proposed Convolutional Neural Network architecture 563 is shown in Figure 9(a). Furthermore, three representations 564

592
An end-to-end personality-driven multi-task deep learning 593 model has been introduced to assess the aesthetics of an 594 image [85] as shown in Figure 9(c). Firstly, image aesthet-595 ics and personality traits are learned from the multi-task 596 model. Then the personality features are used to modulate 597 the aesthetics features, producing the optimal generic image 598 aesthetics scores. 599 Bianco et al. [86] used deep learning to predict image 600 aesthetics using aesthetic visual analysis (AVA) [49] dataset. 601 This model fine-tuned canonical convolutional neural net-602 work architecture to obtain aesthetic scores in this model. 603 Aesthetic quality assessment is treated as a regression prob-604 lem. Caffe network [106] is selected to be fine-tuned, and 605 the last fully connected layer of CaffeNet is replaced by a 606 single neuron providing an aesthetic score between 1 and 10.

640
A two-column content-adaptive aesthetic rating neural net-641 work is proposed that takes into account both style contents 642 and semantic information [88]. Each column is trained on 643 two different crops of a single image. Each column consists 644 of three convolutional layers and three pooling layers fol-645 lowed by a fully connected layer. Finally, style and semantic 646 features extracted by both columns are fused by two fully 647 connected layers as shown in Figure 10(b). The network 648 is trained using end-to-end learning and stochastic gradient 649 descent. A network adaptation strategy is proposed to facil-650 itate content-based image aesthetics. This helps improve the 651 adaptation of images' semantic contents; hence, fewer images 652 from each category are required for training. A Regularized 653 Double-column Convolutional Neural Network (RDCNN) 654 is proposed, which includes a single Style Column Convo-655 lutional Neural Network (Style-SCNN) for style informa-656 tion and a Double-Column Convolutional Neural Network 657 (DCNN) for semantic information. The final structure of the 658 framework is shown in Figure 10(c). This network is tested on 659 the AVA dataset and IAD dataset [112] to categorize images 660 into high and low quality and achieves 71.2% accuracy.

661
A composition preserving convolutional neural network 662 has been proposed for photo aesthetic assessment [89]. The 663 network incorporates the concept of image quality degrada-664 tion by resizing and clipping. Multi Net Adaptive spatial pool-665 ing Convolutional Neural Network (MNA-CNN) is designed 666 to rate variable size images. For this purpose, an adaptive 667 spatial pooling layer is introduced that adjusts its receptive 668 size according to output rather than input. There are multiple 669 streams of network [113] where an adaptive spatial pooling 670 layer replaces the last pooling layer. Pre-trained VGG [114] is 671 fine-tuned on Torch Deep Learning package [115], and each sub-network is trained separately. Another scene categoriza-673 tion CNN is trained on Places205-GoogleLeNet consisting of 674 2.5 million images. This framework is shown in Figure 10 Hue, saturation, and value are directly computed from the 720 image, whereas the other attributes are learned using parallel 721 deep neural networks as shown in Figure 12(b). This network 722 predicts a label 0 or 1 and is trained using the AVA dataset. 723 Their high-level synthesis network is a four-layer convo-724 lutional neural network. This network predicts the overall 725 aesthetic level of the image. At this stage, the entire network 726 is trained end-to-end using the AVA dataset. Experiments are 727 performed on 12 CPUs (Intel Xeon 2.7 GHz) and a GPU 728 (Nvidia GTX680). Training and fine-tuning take around one 729 day with an accuracy of 76.80%. For image aesthetic quality assessment, Liu et al. [92] pro-732 posed a semi-supervised deep active learning (SDAL) algo-733 rithm, which discovers how humans perceive semantically 734 significant regions from many images partially assigned with 735 contaminated tags. 736 An adaptive fractional dilated convolution is devel-737 oped [118], which is aspect-ratio-embedded, composition-738 preserving and parameter-free. The fractional dilated kernel 739 is adaptively constructed according to the image aspect ratios, 740 where the interpolation of the nearest two integers dilated 741 kernels are used to cope with the misalignment of fractional 742 sampling.

743
A convolutional neural network is used to investigate the 744 relationship between image measures, such as complexity, 745 and human aesthetic evaluation, using dimension reduction 746 methods to visualize both genotype and phenotype space 747 to support the exploration of new territory in a genera-748 tive system [93]. Convolutional neural networks trained on 749 the artist's prior aesthetic evaluations are used to suggest 750 new possibilities similar to or between known high-quality 751 genotype-phenotype mappings.  images to encode spatial interaction of the visual elements.

784
AVA dataset contains 255,000 images that are associ-785 ated with 963 challenges. While treating the aesthetic qual-786 ity as a binary-class classification problem, images having 787 an average aesthetic score value greater than the threshold 788 value 5 + σ are labelled as positive. In contrast, those with 789 an average aesthetic score value less than 5 − σ are negative. 790 Training and testing sets contain 230,000 and 20,000 images 791 respectively for a hard threshold σ = 0. Another split is 792 also used to account for the top 10% and bottom 10% of the 793 images, thus obtaining 25,000 images in the training set and 794 25,000 in the testing set.  tion, over-exposed, under-exposed, motion blur, no blur, out 829 of focus, and partially blurred.       • Content-Based Methods. Table 4 gives the compari-920 son between content-based hand-crafted methods. All 921 the methods are evaluated for bi-level image aesthetic 922 assessment tasks. The number of images employed by 923 content-based is relatively higher than other previously 924 mentioned methods. Most methods use SVM as a classi-925 fier while there is no set choice for features and datasets. 926 It can also be observed that the higher the number of 927 images in the dataset lower the accuracy and vice versa. 928 In summary, a large dataset is not required for hand-929 crafted methods. These techniques use a few hundred or 930 a few thousand images to train classifiers. Almost 75% of 931 articles discussed in this survey utilized an SVM classifier to 932 classify images into high and low aesthetic levels, and around 933 15% used support vector regression. Here, the regression 934 provides a continuous score on which threshold is applied 935 for classification into different aesthetic levels. Hand-tuned 936