Automatic Monitoring Cheese Ripeness Using Computer Vision and Artificial Intelligence

Ripening is a very important process that contributes to cheese quality, as its characteristics are determined by the biochemical changes that occur during this period. Therefore, monitoring ripening time is a fundamental task to market a quality product in a timely manner. However, it is difﬁcult to accurately determine the degree of cheese ripeness. Although some scientiﬁc methods have also been proposed in the literature, the conventional methods adopted in dairy industries are typically based on visual and weight control. This study proposes a novel approach aimed at automatically monitoring the cheese ripening based on the analysis of cheese images acquired by a photo camera. Both computer vision and machine learning techniques have been used to deal with this task. The study is based on a dataset of 195 images (speciﬁcally collected from an Italian dairy industry), which represent Pecorino cheese forms at four degrees of ripeness. All stages but the one labeled as ‘‘day 18’’, which has 45 images, consist of 50 images. These images have been handled with image processing techniques and then classiﬁed according to the degree of ripening, i.e., 18, 22, 24, and 30 days. A 5-fold cross-validation strategy was used to empirically evaluate the performance of the models. During this phase, each training fold was augmented online. This strategy allowed to use 624 images for training, leaving 39 original images per fold for testing. Experimental results have demonstrated the validity of the approach, showing good performance for most of the trained models.


I. INTRODUCTION
Dairy products have a high commercial value in the food industry, even considering that they are a source of proteins, calcium, and micro-nutrients with beneficial effects on bone and muscle health. In addition, thanks to their probiotic content, they increase the health of the digestive tract and positively influence the microbiome.
Among dairy products, cheese is a popular food produced and consumed in many parts of the world. The cheese quality depends on several characteristics, including chemical components, internal structure, physical properties, and other attributes such as oxygen and dielectric properties [1]. Other important aspects that determine the quality of the final product are sensory aspects stimulated by specific properties The associate editor coordinating the review of this manuscript and approving it for publication was Gianluigi Ciocca . and components of the cheese. Not incidentally, the transformations of the milk constituents that affect the cheese quality occur during the ripening phase. In particular, several biochemical changes (lipolysis and proteolysis) that occur during this process significantly affect the cheese's flavor, aroma, and texture [2]. Consequently, monitoring the degree of ripeness is a mandatory step in the product quality assurance process [3]. Unfortunately, accurately determining the degree of maturation of the cheese is not trivial, even when considering only a specific product. The point here is that this process is not entirely predictable or controllable, being influenced by many factors -including season, the origin of the milk, processing steps, and temperature of the storage places [4]. On the other hand, an error in determining the maturation level of cheese typically has many negative consequences. In particular, for soft cheeses, the decision to place them on the market must be made within hours, as the ripening process usually takes a few weeks. In either case (i.e., cheese not mature enough or cheese too mature), a product of lower quality than expected is sent to the market, with potential risks for the company in terms of income or prestige (also note that keeping aged cheeses in the dairy has warehouse costs). Visual inspection and weight checking are the conventional methods adopted in dairy industries to assess cheese ripeness. In both cases, despite the training of the personnel involved, there is room for occasional errors, which can result in costs for the companies. Moreover, checking an entire batch of cheese with these methods can be very timeconsuming, so most often, only some samples are checked out. For all these motivations, the dairy sector's interest is growing in adopting cutting-edge technologies capable of effectively monitoring the maturation process.
In recent years, non-invasive techniques for monitoring cheese ripening have been adopted, typically based on the analysis of physicochemical, chromatographic, and electrophoretic characteristics. However, these techniques are time-consuming and expensive [5]. Other non-invasive methods based on spectroscopic techniques have also been proposed in the literature. Without claim to be exhaustive, some works are recalled from now on. Dufour et al. [6] analyzed the mid-infrared and fluorescence of sixteen different kinds of cheese collected at four ripening times. The similarity maps obtained by applying principal component analysis (PCA) to the acquired spectra demonstrated the feasibility of discriminating the cheeses according to their ripening times. FT-NIR (Fourier Transform Near-Infrared) and FT-IR (Fourier Transform Infrared) spectroscopy have been applied in [7] to study the shelf-life of Crescenza cheese.
Spectral data were acquired from cheese samples collected at different times for twenty days. The PCA applied to this data detected the decrease in the ''freshness'' of Crescenza and defined the critical day during the shelf-life period. Del Campo et al. [8] used mid-infrared spectroscopy to study the characterization of the ripening stages of Emmental cheeses. Using PCA, the authors could discriminate among four categories of cheese ripening. Specifically, these groups include samples ripened during i) 21, 27, and 34 days, ii) 51 and 58 days, iii) 65 days, and iv) about 85 days (i.e., the samples as found at the end of the ripening process). Soto-Barajas et al. [9] proposed a study aimed at predicting cheese ripening and the types of milk mixtures used therein. Artificial neural networks (ANNs) were designed and trained with data reporting those cheeses' fatty acid composition and NIR spectra. The ANN model was able to predict the ripening of cheeses acquired in a period of six months.
Among the various food industry-related tasks in which we can frame our work, the spectrum of tasks addressed is extremely diverse. The problem of food quality and authenticity determination is a hot topic [10]. In this context, Jahanbakhshi et al.proposed methods based on computer vision and deep learning techniques to detect adulteration in turmeric powder [11], [12], while Al-Sarayreh [13] proposed a novel method based on the combination of spectral and textural information from red meat images to identify possible adulteration. Other important topics are certainly food recognition, retrieval, and classification [14], [15], [16] as a potential help in making recipes [17], for example, or food calorie estimation [18], [19].
This work is primarily motivated by the need to implement a non-invasive approach able to automatically and accurately determine the degree of ripeness of the cheese. This approach has the advantage of automating cheese control, eliminating common human errors that can lead to placing a non-quality product on the market. All the above mentioned aspects are addressed in this work. The goal is to implement an automated system to ensure the production of a quality product.
This paper proposes a new non-invasive approach for monitoring the cheese ripening process using advanced computer vision (CV) and machine learning (ML) techniques. Specifically, cheese images acquired with an ordinary camera are processed with CV techniques and then classified using relevant ML algorithms.
Our original contribution is fivefold: i) four different categories of visual handcrafted (HC) descriptors were extensively studied and compared; ii) the classification performance of different families of ML classifiers was analyzed and compared; iii) ten different convolutional neural networks in an end-to-end deep learning (DL) classification were evaluated; iv) a novel image processing pipeline, from the image acquisition to the ripeness classification, is proposed; v) an extensive comparative analysis in a novel domain and an effective pipeline to solve the task is proposed.
To the best of our knowledge, this is the first study aimed at evaluating cheese ripening by analyzing images acquired through a photo camera.
The remainder of this article is organized as follows. Materials and methods are described in Section II. Experimental settings and the corresponding results are reported in Section III. A discussion of the results follows in Section IV, together with the challenges raised by this task. Finally, conclusions are drawn in Section V.

II. MATERIALS AND METHODS
The automated monitoring of the cheese ripeness was approached as a supervised classification problem, characterized by the need to correctly map a cheese image to its actual maturation degree. Specific strategies based on DL and machine learning have been implemented and used to tackle this classification problem. In particular, DL algorithms were used with a twofold aim: i) to build classifier models and ii) to employ them for feature extraction. Both tasks were addressed using off-the-shelf convolutional neural network (CNN) architectures proposed in the literature. Beyond the adoption of DL techniques and architectures, several classifier models were constructed from handcrafted features obtained by image processing techniques and features extracted from CNNs. As for the dataset used in this study, all images were collected in a local dairy industry.   ., 30_Aged).
The dataset consists of cheese images representing four degrees of ripeness of a specific type of soft cheese.
This section is organized as follows: the image dataset is first described in Section II-A; then the pipeline aimed at preprocessing the dataset is described in Section II-B; and afterward the extracted features are described in Section II-C. Subsequently, the adopted DL and ML algorithms are briefly recalled in Section II-D and Section II-E. Finally, the performance measures used to assess the models are reported in Section II-F.

A. DATASET CHARACTERISTICS
The dataset used in this research was built by collecting images of a Pecorino cheese produced by a Sardinian (Italy) dairy company. The specific product is classified as soft cheese, as its maturing period reaches its completion in 20-25 days. The dataset was built with the support of the Sardinian agency for the implementation of regional agricultural and rural development programs (LAORE 1 ).
The dataset consists of 195 images representing the selected Pecorino cheese forms at four degrees of ripeness, i.e., 18, 22, 24, and 30 days. All stages but the one labeled as ''day 18'', which has 45 images, consist of 50 images (see Table 1).
The highly trained staff of the dairy ensured that no acceleration or slowdown was observed in the selected forms. As a consequence, those labeled as ''day k'' represent that day in an ideal ripening process. Note that for this specific product, the ripening process completes on ''day 24,'' which means that the forms labeled as ''day 18'' and ''day 22'' should be considered insufficiently ripe. Conversely, those labeled ''day 30'' should be regarded as overripe.
The camera used to acquire images was a Nikon D750, with a CMOS 35.9 × 24.0 mm sensor and a resolution of 24 Mpixel. All dataset images have a resolution of 4, 016× 6, 016 or 6, 016 × 4, 016. A sample image for each category is shown in Figure 1.

B. DATA PREPROCESSING
Data preprocessing consists of a pipeline of two steps, i.e., image segmentation, and cropping. As shown by the sample reported in Figure 1, a non-uniform background characterizes all the images. As not beneficial for the classification task, it has been removed by a proper segmentation process that allowed substituting the original background with a neutral and constant one.
Image segmentation -Initially represented in RGB color space, all images were first converted to HSV space, being more representative of the contrast between the cheese and background region. Then, the images were segmented by a threshold approach, taking the saturation channel as a reference. Specifically, the images were segmented (i.e., binarized) by choosing an automatic adaptive threshold based on the local average intensity (first-order statistics) in the neighborhood of each pixel [20]. Thanks to the binarization process, the object of interest, represented by the cheese region, was properly extracted by selecting the bounding box of the biggest region, i.e., the cheese region. This operation made it possible to preserve the information related to the cheese region without influences dictated by the uneven background.
Image cropping -Once the cheese region was segmented, an image was produced, whose dimensions were derived from the size of the region itself. This image is formed by a constant background (black color was chosen) and the segmented object of interest. The cropped images obtained from the ones depicted in Figure 1 are presented in Figure 2. The images thus generated are then provided to the next stage of analysis, which is the feature extraction, or directly used as input for a convolutional neural network -depending on the chosen classification approach.
It is worth noting that an additional step has been applied while implementing the DL-based approaches. In this case, after cropping, the images have also been scaled according to the input shape (i.e., 224 × 224 or 299 × 299) of the adopted CNN architectures.

C. FEATURE EXTRACTION
Prior to this stage, a thorough image preprocessing pipeline allowed us to obtain representative images of the cheese acquired during the ripening process. Then different feature sets in various combinations were extracted and used to train the selected ML models. According to Putzu et al. [21], these features can be grouped into four main categories: invariant moments, texture, color, and deep features. The first three categories are handcrafted, while the last one includes all features extracted by means of CNNs. The selected categories represent a broad spectrum of existing descriptors in computer vision and are used in multiple activities, e.g., industry [22], biomedicine [23], [24], [25], agriculture [26], [27], video-surveillance [28], [29]. To the best of our knowledge, no ''off-the-shelf'' solution is available in the literature for this purpose. In fact, the existing methods presented in Section I address different types of cheese and ripening days. This prompted us to develop an ad hoc image processing pipeline that makes use of the above descriptors.
Invariant Moments -An image moment is a weighted average (i.e., the moment) of the image pixel intensities used to extract some properties from an image. Moments are used in image analysis and pattern recognition to describe objects  after segmentation. In this work, four types of moments were used (i.e., Hu, Zernike, Legendre, and Chebyshev). Let us briefly summarize them.
• Hu moments (HM) have been proposed by Hu [30] to solve pattern recognition problems. They are invariant to translation, scale, and rotation changes.
• Legendre moments (LM) are obtained using Legendre polynomials as the kernel function. First introduced by Teague [31], they belong to the class of orthogonal moments and can be used to attain a near zero value of redundancy measure in a set of moment functions. These moments can be used to highlight independent characteristics of the image [32].
• Zernike moments (ZM) are based on the Zernike polynomials, an orthogonal sequence of polynomials on the unit disk. These orthogonal moments are used to represent image properties without redundancy [33].
• Chebyshev moments (CH) were first proposed by Mukandan et al. [34]. Unlike Zernike and Legendre moments, they belong to the class of discrete orthogonal moments. Hence, the implementation of these moments does not involve any numerical approximation. They are derived from Chebyshev polynomials and can extract global features in an image by varying moment order [35]. Here, first and second-order Chebyshev moments (denoted as CH_fi and CH_se) were extracted. All mentioned invariant moments have been calculated with an order ranging from 3 to 10, being aware that higher orders would only increase the computation time by adding features representative of irrelevant details or noise [35].
Texture features -The texture features evaluated in this proposal were focused on fine textures. In particular, the histogram of the Local Binary Pattern (LBP), converted to a rotation invariant form, viz. LBP_ri [36], was extracted and used as the feature vector. Also, thirteen Haralick features extracted from the Gray Level Co-occurrence Matrix (GLCM) [37] converted into rotation invariant features, viz. HAR_ri (see [37]). More specifically, four kinds of GLCM were computed, all with d = 1 and θ = [0 • , 45 • , 90 • , 135 • ]. As for LBP, a map in the neighborhood characterized by r and n equal to 1 and 8, respectively, was calculated.
Color features -The color histogram (Hist) and color autocorrelogram (AC) features were extracted as color features. The former describes the overall color distribution in the image, from which seven statistical descriptors were computed: mean, standard deviation, smoothness, skewness, kurtosis, uniformity, and entropy. For ease of analysis and computations, they were calculated from images converted to grayscale. The latter is used to find the spatial correlation of identical colors by encoding their spatial color distribution [38]. More specifically, the probability of finding two pixels of the same color at a distance d is saved. Four distance values were used for our experiments: d = 1, 2, 3, 4. Once extracted, the four probability vectors were concatenated to generate a single feature vector.
Deep features -The term refers to the features of an image extracted from the deep layers of a CNN. The deep features were extracted from off-the-shelf CNN architectures pretrained on the well-known natural image dataset Ima-geNet [39]. Specifically, these features represent the activation values obtained from one of the most significant layers at the end of the network [40], [41]. In this study, according to the architecture, deep features were extracted: i) from the penultimate layer, ii) from the last fully connected layer or iii) from the last pooling layer to obtain features that represent the learned global knowledge of the network (see Table 2).
Note that the fine-tuning strategy for the classification phase was not considered to preserve the generalization ability of the networks [42], [43].

D. DEEP LEARNING STRATEGIES
CNNs are powerful architectures that are increasingly and successfully used to address a variety of image classification problems in different domains [27], [44], [45]. In addition, they transformed the manual design of feature extraction into an automated process, being able to take advantage of the activations of their layers to obtain the learned features.
In this study, different CNN architectures have been used to build classifier models to address the problem at hand and to perform the feature extraction task. Features extracted by the selected CNNs were used to train different models according to the ML algorithm described in Section II-E. In particular, regarding the end-to-end deep learning strategy, pretrained models' weights on ImageNet were considered. They were optimized by fine-tuning the pretrained models with the common practice of freezing all layers except the last (three) fully-connected layers from training. With this solution, transfer learning was adopted to reduce the problem of insufficient training data and avoid the risk of overfitting [46]. The CNN architectures used in this study (see Table 2) are briefly recalled hereinafter.
AlexNet -Proposed by Krizhevsky et al. [47], AlexNet consists of a cascade of convolutional and max-pooling layers, ended by 3 fully-connected layers. It is the shallowest among the considered architectures, containing only 5 convolutions layers.
ResNet -This name is used to denote a set of deep architectures based on residual learning [49], composed of skip-connections or recurrent units between blocks of convolutional and pooling layers. Furthermore, the blocks are followed by a batch normalization [55]. In this work, three versions of ResNet have been used, i.e., ResNet-18, ResNet-50, and ResNet-101 (note that the specified number represents the network depth).
GoogleNet -In 2014, Szegedy et al. [48] proposed the GoogLeNet architecture, which is based on blocks of inception layers. Each block is a set of convolution layers, while the filters used can vary from 1 × 1 to 5 × 5, thus allowing multi-scale learning. GoogLeNet uses global average pooling instead of max-pooling.
Inception-v3 -The Inception-v3 [50] architecture is also based on the concept of inception layers but improves GoogLeNet through the use of factorized, smaller, and asymmetric convolutions. The Inception models are famous for their multi-branch architectures, having a set of filters (1 × 1, 3 × 3, 5 × 5, and so forth) that are merged with concatenation in each branch.
Inception-ResNet-v2 -Devised as a combination of ResNet and Inception architectures [51], multiple-sized convolutional filters are combined with residual connections in the Inception-Resnet block. Inception-ResNet-v2 consists of 4 max-pooling and 160 convolutional layers.
DarkNet -Mostly based on the existing concepts of inception and batch normalization, one of the versions of DarkNet embeds 53 convolutional layers. This architecture is the backbone network for the object detection method You Only Look Once (YOLO) [52].
DenseNet -Proposed with L(L + 1)/2 connections [53], this architecture was introduced to overcome the fact that traditional CNNs had some layers L equal to the number of connections. For each layer, the outputs of all previous layers are used as input to the next layer. The number of filters used in each convolutional layer varies according to the growth rate parameter (viz. k). In this study, DenseNet-201 with k = 32 has been adopted.
EfficientNet -Proposed by Tan et al., EfficientNet [54] uses compound scaling to uniformly and efficiently scale width, depth, and resolution of the network. Eight versions of EfficientNet exist, from B0 to B7. In this study, B0 has been adopted.
k-NN -To categorize an observation, a k-NN classifier uses a local strategy that involves the k nearest neighbor training examples, together with a voting policy that yields a prediction starting from the selected neighbors. In this proposal, an image is classified according to the nearest image found in the training set, meaning that k is set to 1. Euclidean distance has been used as a distance measure. Note that with k = 1, no voting strategy is actually required.
SVM -SVMs are binary classifiers that categorize observations according to their position with respect to a hyperplane drawn in accordance with the so-called ''support vectors''. By default, an SVM is linear; however, when no hyperplane can properly discriminate between negative and positive observations, the original problem can be projected into a multidimensional space employing suitable kernel functions. In this proposal, a Gaussian radial basis function (RBF) has been used as the kernel. Note that multiclass problems can be dealt with SVMs using the One-vs-Rest (OvR) approach.
DT -Downstream of training, a DT can predict the category of observation by simply looking at the set of embedded if-then rules that have been inferred from training data. The deeper the tree, the more complex the decision rules.
RF -An RF is an ensemble of DTs. To ensure diversity among the DTs, ad-hoc rules are enforced during training. As a result, RFs are typically robust concerning the imbalance of data, allowing better control overfitting and better generalizations. In this work, the number of DTs has been set to 100.

F. PERFORMANCE MEASURES
The classification performance has been measured in terms of accuracy, precision, specificity, sensitivity, F1-score, and Matthews Correlation Coefficient (MCC). Straightforward definitions of these measures for binary classification problems are given hereinafter, followed by their generalizations for multiclass problems.

1) STANDARD DEFINITIONS FOR BINARY CLASSIFICATION PROBLEMS
An example (say e) is characterized by a pair i, t , where i is a list of feature values and t is the assigned category (i.e., target category). A dataset D is defined as a set of examples. When the number of target categories in D is 2, we face a binary problem. In this case, the categories found therein can be called negative and positive. To measure the performance of a binary classifier on a dataset D, each instance occurring therein will be labeled as negative or positive, depending on the classifier's output. According to the classification outcome and the actual target value, an instance will increase one of the following values: • true negatives (TN) -Number of instances belonging to the negative class that have been correctly predicted; • false positives (FP) -Number of instances belonging to the negative class that have been incorrectly predicted.
• false negatives (FN) -Number of instances belonging to the positive class that have been incorrectly predicted; • true positives (TP) -Number of instances belonging to the positive class that have been correctly predicted. According to these quantities, the above-mentioned measures can be described as follows: • Precision -Fraction of positive instances correctly classified among all instances classified as positive: • Sensitivity (or recall R) -Measures the ability of the classifier to predict the positive class against FN (also called true positive rate): • Specificity -Measures the ability of the classifier to predict the negative class against FP (also called true negative rate): • F1-score -Defined as the harmonic mean between precision and recall: • Accuracy -Ratio between the number of instances correctly classified and the total number of instances: Note that a different definition of accuracy may also be adopted when working with unbalanced datasets. It is called balanced or unbiased accuracy, which is defined as the mean between specificity and sensitivity. This alternative measure completely ignores the imbalance. 2 In symbols: In this work, balanced accuracy is used to measure the performance of classifiers, as some experimental settings generate an imbalance among the classes.
• MCC -MCC combines TN, FP, FN, and TP. Ranging from −1 to +1, MCC provides a high score only when the classifier shows a good performance in both categories: 2

) STANDARD DEFINITIONS FOR MULTICLASS CLASSIFICATION PROBLEMS
As pointed out, the cited measures can also be generalized to deal with multiclass classifiers. A straightforward strategy to meet this need is calculating the measures for each category using an OvR approach. Downstream of this process, the average value of each binary measure is calculated, thus yielding an informative value for the multiclass model. Three different averaging methods could be used -i.e., micro, macro and weighted. In this work, macro averaging has been adopted. In summarizing, for a classification problem on K classes, the measures with macro averaging are calculated as follows: • Macro Average Precision (with P k denoting per-class k precision): • Macro Average Sensitivity (with SEN k denoting perclass k sensitivity): • Macro Average Specificity (with SPE k denoting perclass k specificity): • Macro Accuracy (with ACC k denoting per-class k accuracy): • Macro Average F1-score (with P and R denoting macro average precision-recall and macro average recall): Multiclass MCC and BA were not calculated with macro averaging. Hence, their reformulation is separately provided. were grouped together to represent nonmature cheese (i.e., unripened). Images acquired on the 24th day of ripening represent the ideal ripeness (i.e., ideal). Too mature cheese is acquired on the 30th day of ripening (i.e., over-ripened).
• MCC directly takes into account all categories of a multiclass confusion matrix.
where c = K k C kk represents the total instances correctly predicted for each class k; s = K i K j C ij ; represents the total number of instances p k = K i C ki ; represents the total prediction for the class k t k = K i C ik . represents the times that class k truly occurred • Balanced Accuracy is calculated as follows: Note that SEN k denotes the per-class k sensitivity. Hence, multiclass BA coincides with the definition of macro sensitivity.

III. RESULTS
In this section, experimental results are presented. The experimental setup is described in Section III-A, whereas the results achieved for ML-and DL-based strategies are reported in Section III-B and in Section III-C.

A. EXPERIMENTAL SETUP
Experiments have been carried out on a workstation equipped with the following hardware: Intel(R) Core(TM) i9-8950HK @ 2.90GHz CPU, 32 GB RAM, and an NVIDIA GTX1050 Ti GPU with 4GB of memory. All the implementations and experimental evaluations have been realized with MATLAB R2021b. Initially, the preprocessed dataset has been used to build classifier models able to map a cheese image to one of the four ripening times (i. e., 18_Aged, 22_Aged, 24_Aged, and  30_Aged). Then, the class labels were reduced to three to better take into account the needs of the dairy company, for which the most relevant aspects were to identify whether a cheese form is not mature yet (days 18 and 22), mature (day 24), or too mature (day 30) (see Table 3). Finally, the selected ML and DL strategies have been tested for this new classification problem.   Image augmentation has also been applied before training the models to avoid overfitting. Specifically, the training set has been augmented by adding three images for each sample. The added images have been obtained by rotating the original ones with an angle of 90 • , 180 • , and 270 • . 5-fold cross-validation has been adopted as a testing strategy. This strategy repeatedly performs training and testing on the same dataset while ensuring the statistical reliability of the results. In particular, at each step, the dataset is split into 80% for training and 20% for the test set (see Table 4 and Table 5). As for CNNs, their fine-tuning has been performed using the hyperparameters reported in Table 6.

As introduced in Section II-E, five different classifiers have been used to conduct the ML-based experiments.
This section only reports the performance achieved by the best classifier models (i.e., the SVM with RBF kernel). The interested reader can find a complete report of the achieved performance for all the ML algorithms in Appendix . The results obtained with both the four-and three-classes classification tasks are reported in the following subsections.

1) RESULTS FOR THE FOUR-CLASSES CLASSIFICATION TASK
Apart from HM, models trained using the HC features achieved good performance with a low standard deviation for all invariant moments. As for SVM, the best performance has been achieved by training the model with LM features  (see Table 7). In fact, every category of moments produced satisfactory results with a limited amount of features (66 for LM, with order 10, and for both CH, with order 8), except for HM and ZM. The former, with only 7 features, proved insufficiently representative (F1 equal to 32.2%). The second, whose best results were obtained with an order of 7 and 9 repetitions, performed satisfactorily despite performance lower than those measured for LM and CH.
As for the textural features, HAR_ri achieved the best performance, even though it is very far from the results obtained with the invariant moments (in fact, F1 is only 48.6%, which makes the corresponding model hard to use). Finally, the best color feature was the autocorrelogram, with the same issues described for HAR_ri. In addition, HAR_ri and AC have the highest standard deviation, as well as HM. Table 8 reports the performance obtained by training the SVM with the features extracted by the CNNs. The best and second-best results are emphasized in bold. An SVM trained with features extracted by ResNet-101 and DenseNet201 achieved the best results. Although the SVM trained with the features extracted by ResNet-101 reached the highest absolute results, DenseNet201 gets similar (though slightly lower) results. However, it shows a significantly better standard deviation for each performance measure (about half that of ResNet-101). Finally, the number of features considered for the classification task is far higher with respect to the HC categories. In fact, ResNet-101 provided 2,048 features, and DenseNet201 1,920.

2) RESULTS FOR THE THREE-CLASSES CLASSIFICATION TASK
The results reported in Table 10 highlight a behavior consistent with that observed on the four-classes problem. In fact, the models trained using the LM features and CH features (both with order 10) obtained a performance above 90% for all assessed performance measures. More specifically, Legendre moments, with order 10, performed better, with F1 = 95.8%, MCC = 94.0%, and BA = 97.1%, with only 66 features. The models trained with HM and ZM features (best results with order 10 and repetitions 6) performed worse than the other invariant moments. Even with HAR_ri, they significantly worsen the results obtained on the four-classes task. Again, in this case, AC is the best color descriptor. However, it reached 51.5% of F1.
As for the results obtained by training the ML models with the deep features, the performance for this task is also relatively high, although in some cases, some performance measures have a significant drop (e.g., GoogLeNet and ResNet-18 obtained MCC of 77.7% and 75.5%, respectively). However, several architectures performed very well. Among all, the features extracted from ResNet-50 achieved the best performance (i.e., F1 = 98.0% and BA = 98.4%), with a considerably reduced standard deviation (below 2.3% for all measures, but MCC). However, AlexNet gets the second-best performance in terms of accuracy, precision, F1, and MCC, whereas ResNet-101 in terms of sensitivity, specificity, and balanced accuracy. Both best and second-best are emphasized in bold.

C. DEEP LEARNING BASED CLASSIFICATION
Section II-D introduced the CNN architectures investigated in this work. This section reports the results (see Table 11) obtained with a fine-tuning strategy adopted on the dataset under this study.

a: RESULTS FOR THE FOUR-CLASSES TASK
In this task, in which the deep features are used for training, the models perform well, with only a few exceptions. Although all CNN architectures achieved a good performance, three of them (i.e., Inception-ResNetv2, DarkNet-53, and ResNet-18) outperformed the others. Specifically, Inception-ResNetv2 and DarkNet-53 achieved the best accuracy, precision, and specificity, with the former showing a lower standard deviation for the three measures. DarkNet-53 achieved better sensitivity, MCC, and balanced accuracy, while Inception-ResNetv2 obtained the best F1 scores. In general, the performance measures of ResNet-18 were slightly lower than those obtained with Inception-ResNetv2 and DarkNet-53. However, a lower standard deviation has been obtained on all measures.
Supported by the results, ResNet-18 can be a potentially robust solution to the problem.

b: RESULTS FOR THE 3-CLASS TASK
Regarding the results obtained in this task, what immediately emerges is the very high variability among the various folds on which the CNN architectures were tested. The standard deviations have considerably high values. For example, the best architecture, Inception-v3, produced an F1 of 74.3%, with a standard deviation of 23.4%. The results of the remaining networks confirm the same behavior.

IV. DISCUSSION
The experiments showed a different behavior of the trained classifiers for the three-and four-classes tasks. In particular, the results on the four-classes task tend to be similar to those obtained for the three-classes task when approached with ML-based strategies. In fact, considering HC features, invariant moments alone allow for considerably high performance with low standard deviation. Moreover, although no single descriptor emerges as the absolute best for both tasks, it is important to point out how Legendre and Chebychev moments (both types) achieved high performance, very close to the absolute best regardless of the order examined. This aspect makes this category highly versatile, and representative of the classes of the problem studied.
The same trend for both classification tasks can be observed when features extracted by CNNs are used to train the SVM. As a general rule, the features extracted from ResNet-101 seem feasible to represent the classes of this study. More specifically, ResNet-101 features are the absolute best referring to the four-classes task, while they are the second-best in the three-classes one. However, in the last scenario, they permit comparable performance to the best features, i.e., ResNet-50. In any case, a residual network architecture may be suitable to address the problem.
Summing up the ML approach results, as seen from the performance measures presented in Appendix , the SVM is  the most suitable classifier when trained with HC features. The same is true for classifiers trained with features extracted from pretrained CNNs. All of the remaining four classification models can be within the performance of the SVM trained with features extracted from residual networks for both tasks.
In contrast, the results obtained with CNNs in classification significantly differ between the two tasks analyzed. In fact, as expressed above, the results of the CNNs on the four classes tend to be high, albeit with fairly high standard deviations. However, this issue does not plague ResNet-18, which has the best results from a standard deviation perspective and the second-best absolute results on various measures.
The scenario noticeably changes when analyzing the CNNs results on the three classes. In fact, unlike the previous task, standard deviations are very high, making this approach unfeasible with this task and leaving open questions and room for further investigations regarding the significance of the four different classes, how to address the problem derived from the fusion of classes 18_Aged and 22_Aged, and how the conditions of acquisition and subsequent preprocessing may have affected the data processed by CNNs.
Some relevant trends emerge, shifting attention to the overall results obtained by the classifiers used. First, as evidenced by Table 13 and Table 14, the best-performing classifier is the SVM in both configurations. Specifically, in the three-class configuration, the SVM obtains an F1 approaching 95% with both handcrafted and deep features, exceeding by two percentage points the F1 obtained by the best CNN, Inception-ResNetv2. In the four-class-class configuration, the SVM obtains 98.0% with features extracted from ResNet-50, and and 95.8% with Legendre moments, exceeding by more than 20 percentage points the best result obtained by the best CNN, ResNet-101 in this case. The Random Forest classifier, trained with the features extracted from DenseNet201 in the first configuration and with both types of features in the second configuration, also achieved comparable and better results than CNN. Second, Legendre moments proved to be the best handcrafted features for the SVM. In fact, in the first configuration, the performance is comparable. In one case, features extracted from ResNet-101 achieve 94.9% of F1, while Legendre's moments obtain 94.8%. However, the difference in the number of features is important. In the former case, ResNet-101 brings in 2,048 features, while Legendre's moments are just 66, with a significant computational advantage. In the second configuration, the difference is slightly more significant. Specifically, the features extracted from ResNet-50 allow an F1 of 98% with 2,048 features, while Legendre's moments stop at 95.8%, but with just 66 features. This demonstrates the high representative power of the descriptors examined, especially Legendre's moments, for the task.
The proposed method has also been compared with other solutions offered in the literature. Nevertheless, a direct comparison with the methods proposed in the literature cannot be implemented. In fact, to the best of our knowledge, the presented approach is the first designed to evaluate cheese ripening by analyzing images acquired through a photo camera. In contrast, other methods analyze chemical or spectroscopic parameters. Moreover, all the methods proposed in the literature focus on different types of cheese and ripening times.
In any case, some works have reported quantitative performance. More specifically, Del Campo et al. [8] proposed a PCA-based model that discriminates among four ripening stages of Emmental cheeses, using mid-infrared spectroscopy. This model achieved 87% cross-validation accuracy on 14 samples, and 57% test set accuracy on 14 samples. The key difference with our work is that the characterization of ripening was analyzed over a longer time frame. Also, Soto-Barajas et al. [9] proposed an ANN trained with data reporting the fatty acid composition and the NIR spectra of sixteen milk mixtures, obtaining 50% accuracy in discriminating among the types and 100% in classifying unknown cheese samples for the ripening time.
Despite the similarities that exist with other proposals, this work should be considered innovative, also for its ability to reach 98.0% classification accuracy in discriminating between the different phases of ripeness. Being non-invasive and end-to-end automatable, it is also useful for identifying problems that may arise within the cheese storage, for example, an inappropriate temperature.
As for future work, different margins of improvement are left. First, only the top layers of the CNNs were trained to take full advantage of the pretrained models and avoid overfitting issues related to the reduced size of the acquired dataset. VOLUME 10, 2022 TABLE 13. Best performance obtained with all the classifiers involved on the three-classes task. The table reports the performance in terms of accuracy, precision, sensitivity, specificity, F1 score, Matthew's correlation coefficient, and balanced accuracy. For each classifier, results obtained with the best handcrafted feature are shown on the top row. The result produced by using the features extracted from the best CNN are reported on the bottom row. The last row shows the best result produced by CNNs.

TABLE 14.
Best performance obtained with all the classifiers involved on the four-classes task. The table reports the performance in terms of accuracy, precision, sensitivity, specificity, F1 score, Matthew's correlation coefficient, and balanced accuracy. For each classifier, results obtained with the best handcrafted feature are shown on the top row. The result produced by using the features extracted from the best CNN are reported on the bottom row. The last row shows the best result produced by CNNs.
Although the results were inferior to those obtained with ML classifiers, we plan to investigate targeted training on different layers of the CNNs, possibly even from scratch. Second, combining some of the extracted heterogeneous features to train the classifier models may further benefit the classifiers investigated so far. Third, a most informed feature selection strategy could greatly help the generalization process for most of the selected ML techniques. Fourth, adopting Generative Adversarial Networks (GANs) to extend the dataset could expand the dataset's variance and allow the proposed methods to be hardened against differences in texture, illumination, shape, etc. Fifth, another factor that can be taken into account is how much the background of the images may affect both feature extraction and the performance of the classification algorithms. In this way, the system could be tuned to use the most appropriate one for the task at hand.

V. CONCLUSION
The proposed study addressed the issue of implementing a novel non-invasive methodology aimed at monitoring the cheese ripening process in order to automatically detect the maturation status. A novel methodology has been presented, which synergistically makes use of CV and ML to identify the cheese maturation status by analyzing images acquired with an ordinary photo camera. This methodology has been applied to specific Pecorino soft cheese produced by a Sardinia dairy company.
A dataset consisting of images of cheeses with an ideal ripeness, in contrast with other images representing unripened and over-ripened cheeses, has been built to test the proposed methodology. Different classifier models have been built, based on ML and DL strategies. These models allowed us to perform a comparative study on i) heterogeneous handcrafted descriptors, ii) deep features extracted by different pre-trained CNNs, iii) classification performance of DL and ML algorithms. The best classification performance has been obtained with models trained using ML strategies, with particular reference to the SVM classifier.
As it can be easily applied to inspect an entire batch of cheese, the proposed methodology can be beneficial to abandoning the habit of sampling cheese and guessing the status of the rest of the cheese forms. For its non-invasiveness and considering that learned models can produce a result with a time order of milliseconds, it lends itself to easy integration into industrial production. Moreover, being related only to image analysis, it could be applied to different types of cheese with an easy customization process.
Taking into account what has been said so far and expressed in Section IV, we consider the proposed approach suitable for deployment in real-time systems, thanks to the short inference times. In fact, the proposed method can be   successfully used within the context of a dairy industry that needs to keep automated, near real-time checks on the   ripening status of cheeses in storage. In addition, it applies to monitoring the storage situation in order to prevent or VOLUME 10, 2022  verify potential issues, such as maintaining an appropriate temperature for ripening.
However, several areas for improvement remain, ranging from combining heterogeneous features to adopting feature selection techniques. In addition, studying the influence of the background of photos or their illumination could make an important contribution to improving the robustness of the system.
Finally, as the results obtained by the CNNs with the three-classes setup have left open questions, it must be considered that artificial intelligence applications are requested to offer a high level of accountability and transparency. Therefore, explanations for algorithm decisions and predictions can be explored to justify their reliability and provide high interpretability for the end users. Tables 15-22. CECILIA DI RUBERTO received the M.S. degree in computer science from the University of Salerno, Italy, in 1990, and the Ph.D. degree in computer science from the University of Naples, Italy, in 1995. She is currently an Associate Professor in computer science at the Department of Mathematics and Computer Science, University of Cagliari, Italy. Her research interests include computer vision, image retrieval, medical image analysis, pattern recognition, and machine learning. She has been working in microscopic image analysis, in particular in blood smear image analysis for cell counting, malaria parasites detection and classification, and leukemia detection. She is the author of more than 100 scientific papers in peer-reviewed journals and international conference proceedings.

See
GIULIANO ARMANO is currently an Associate Professor in computer science at the Department of Mathematics and Computer Science (DMI), University of Cagliari. With a background on expert systems and machine learning, over the years, he has investigated various kinds of learning strategies and techniques. His current research interests include performance measures for classifier systems, artificial neural networks, complex networks, and classifier ensembles. These topics have been extensively experimented on various application fields, in particular bioinformatics, information retrieval, and text categorization. He is currently teaching ''programming languages for mathematics'' and ''laboratory of big data.'' Previously, his teaching activities include ''elements of bioinformatics,'' ''object-oriented programming languages,'' and ''software engineering.''