SQ-Swin: Siamese Quadratic Swin Transformer for Lettuce Browning Prediction

Enzymatic browning is a major quality defect of packaged “ready-to-eat” fresh-cut lettuce salads. While there have been many research and breeding efforts to counter this problem, progress is hindered by the lack of a technology to identify and quantify browning rapidly, objectively, and reliably. Here, we report a deep learning model for lettuce browning score prediction. To the best of our knowledge, it is the first-of-its-kind on deep learning for lettuce browning prediction using a Siamese Quadratic Swin (SQ-Swin) transformer with several highlights. First, our model includes quadratic features in the transformer model which is more powerful to incorporate real world representations than the linear transformer. Second, a multi-scale training strategy is employed to augment the data and explore more of the inherent self-similarity of the lettuce images. Third, the proposed model uses a siamese architecture which learns the inter-relations among the limited training samples. Fourth, the model is pretrained on the ImageNet and then trained with the reptile meta-learning algorithm to learn higher-order gradients than a regular one. Experiment results on the fresh-cut lettuce datasets show that the proposed SQ-Swin outperforms the traditional methods and other deep learning-based backbones.


I. INTRODUCTION
Lettuce (Lactuca sativa L.) is an important vegetable crop worldwide and a common ingredient in packaged 'readyto-eat' salads.However, browning discoloration on the leaf ribs' cut edges significantly reduces the vegetable quality and shortens the shelf-life [1], [2].Many studies have shown that the browning of fresh-cut lettuce is caused by the enzymatic oxidation of phenolic metabolites [3].The tissue injuries sustained during cutting disrupt the natural compartmentation of polyphenol oxidase and phenolics, thus accelerating the browning process [4], [5].
Developing technologies to accurately identify, quantify, and predict browning is critical, yet, achieving both reliability and efficiency is challenging.Commonly used colorimeters are useless since the cut edges are much narrower than The associate editor coordinating the review of this manuscript and approving it for publication was Mehdi Bagheri .
the available smallest probes.Visual evaluation approaches are not only time consuming but also subjective.Digital imaging following feature extracting with RGB or Lab improved objectivity and speed over the traditional visual method.However, multiple values required to depict the discoloration made it impossible for data comparison among lettuce cultivars.
Compared to the traditional learning models, deep learning models can be trained end-to-end to learn the high-dimensional features incrementally, thereby eliminating the need for domain expertise and manual feature extraction.Due to the advantages, a plethora of deep learning algorithms have been developed in many science and engineering fields [6], [7], [8], [9], [10], [11], [12].Very recently, the transformer has been featured as the most popular backbone architecture [13], [14], [15], [16], [17], [18], [19], [20], [21], [22].Dosovitskiy et al. first introduced the transformer to the computer vision (CV) field by mapping a 16×16 image patch into a word sequence [23].Based on the first-of-its-kind vision transformer, Yuan et al. introduced a Token-to-token process to further enrich the token selection from the image.Wang et al. further extended the Token-to-Token method to low-dose CT denoising and obtained more perceptually pleasing CT images [24].Deng et al. introduced the distillation to the transformer by taking the convnet as a teacher and the transformer as a student, thus achieving comparable performance with convnet on ImageNet [25].Han et al. designed a transformer by considering the attention inside the image patches.Liu et al. further employed hierarchical shifted window attention on the vision transformer and achieved the state-of-the-art performances in image classification, object detection, and semantic segmentation [26].
The transformer models have proved their successes in various computer vision tasks.However, their input tokens fed into the inherent self-attention module are mapped from the image features using the simple linear neuron.Nevertheless, the bio-neuron is far more complex and diverse in the real world.Quadratic neuron is one of the critical neurons that features quadratic mapping across the neural layers.Research has shown that quadratic neurons encode more rich features than linear features [27], [28], [29], [30].Therefore, we are motivated for the first time to introduce the quadratic neuron to the transformer and prototype a high-order transformer based on Swin transformer modules.Naturally, we are curious whether the secondorder SQ-Swin can deliver competitive performances in the lettuce browning prediction field, hence, transforming the theoretical deep learning model into proper deep learning applications.
To the best of our knowledge, the contributions of this paper are fourfold: i) The first deep learning model is developed for lettuce browning prediction.ii) A quadratic transformer model is first proposed to empower the token mapping neurons with more powerful representation ability.
iii) The siamese architecture is introduced to the quadratic transformer to learn the inter-relations among the limited samples.Then, the reptile meta-learning algorithm is adopted to train the proposed model.iv) With the merits of transfer learning, the SQ-Swin is pretrained on the ImageNet, then trained and evaluated on the fresh cut lettuce data.Comprehensive experiments demonstrate the high efficacy and the efficiency of the SQ-Swin on the lettuce browning prediction problem.

II. RELATED WORKS A. QUADRATIC NEURON
The mainstream deep neural network uses linear neurons for feature inference.However, in the real world, the bio-neuron such as that in the human brain's nerve cell is fundamentally more diverse and complex.Fan et al. prototyped a quadratic neural network by replacing the inner product in the traditional artificial neuron with a quadratic operation [27].The quadratic neuron is the first high-order neuron.It has been explored in a plethora of researches and shown the advantage over the traditional linear neuron in representation and efficiency [27], [28], [29], [30].
Mathematically, given an input x ∈ R n , a quadratic neuron q is characterized as = 0, assisted by a shrunk small learning rate to prevent magnitude explosion [31].

B. SIAMESE MODEL
The siamese model was proposed to address the few-shot learning problem by learning the inter-relations among images [32], and it has been explored in various applications [33], [34], [35], [36].Bertinetto et al. adopted the fully-convolutional siamese design in video object detection and achieved high performances in many benchmarks [33].Guo et al. proposed a simple yet effective siamese fully convolutional classification and regression network for visual tracking and achieved leading performance in real-time speed [37].Chen et al. developed a siamese transformer to extract the non-local features from the multitemporal image pairs and demonstrated the model's potentials in the multitemporal remote sensing interpretation tasks [38].

III. METHODS
The lettuce browning score prediction task can be taken as a typical regression problem in computer vision.Given data (x, y), a general deep regression model aims to minimize the mean square error (MSE) between the predictions and the labels: where f (θ; x) is the model, θ is a collection of the model parameters, and N is the number of samples.
To further explore the inter-relations among the limited data, a siamese design is introduced to the proposed SQ-Swin transformer.Given data pairs (x 0 , y 0 ) and (x 1 , y 1 ), suppose the two base quadratic swin transformers generate features and predictions (f 0 , ŷ0 ) and (f 1 , ŷ1 ), respectively.Two losses are employed in the SQ-Swin model during optimization: prediction loss and siamese loss.The prediction loss generalizes the margin between the predictions and the true labels.The siamese loss is based on the physical prior that the linear combination of specific features determines the lettuce browning level.Therefore, the difference between the features should indicate the difference between the labels.Specifically, The prediction loss is defined as: The siamese loss is defined as: The total loss is the combination of the two losses with balance parameters α and β, where α and β are set to 1 and 0.5, separately.
With the introduction of the Siamese design, the model's deep learning process is engineered to differentiate browning scores.The integration of the prior ensures that the learned features are closely correlated with the actual browning state of lettuce, leading to more accurate and representative feature embeddings.

A. MODEL ARCHITECTURE
This paper proposes a siamese quadratic swin transformer for lettuce browning prediction (SQ-Swin).As shown in Fig. 1, the multiple-scale patch extraction stage first generates scaled image patches for latter training process.Then, image patches are randomly paired to feed into the SQ-Swin model.Specifically, the SQ-Swin takes the siamese design as the base architecture, which includes two routes of quadratic Swin transformer (Q-Swin) for feature inference, in which the two Q-Swins share identical designs and parameters.Finally, two multilayer perceptrons (MLPs) are attached on top of the Q-Swin for feature extraction and score prediction, respectively.

1) MULTI-SCALE PATCH EXTRACTION
Multiple scale visual patterns are essential in semantic segmentation, fine-grained classification, and object detection [39], [40], [41], [42], [43].Inspired by these works, we seek to exploit the multi-scale information from the lettuce images.In Contrast to the common methods that integrate multi-scale features in the network design, we derive the multi-scalability directly from the images by taking advantage of their self-similarities.Specifically, the training images are extracted from multi-scale patches of different sizes from the large raw lettuce image, then resized to a uniformed shape when fed to the models.By introducing these augmented images alongside the original dataset, the model becomes equipped to handle scale variations frequently encountered in real-world scenarios, where a lettuce may be photographed from different distances and angles.This strategy empowers the model with robustness to scale variations, ensuring it can detect and predict browning irrespective of the scale at which the lettuce is photographed.

2) SQ-SWIN TRANSFORMER
Recently, the Swin transformer has been very popular as a strong backbone architecture [26].We are also motivated to extend this work and apply it to the proposed model, referred to as SQ-Swin.Specifically, it incorporates the hierarchical attention with shifted windows as shown in Fig. 2. Particularly, given an input image, the patch partition process first splits the image into non-overlapped patches.Then linear embedding module maps these patches into tokens to feed to the transformer blocks.Several blocks with attention are applied for the feature inference.To reduce computational complexity and facilitate hierarchical representation, the patch merging layers between successive transformer blocks are applied to reduce the feature map size by half.Based on the design of the Swin transformer, the proposed quadratic Swin transformer further employs a quadratic transformer block for window attention rather than a regular linear transformer.
Quadratic Swin Transformer Block: Instead of calculating the attention globally where the relations among all the tokens from all the patches are calculated, the swin transformer calculates attention within a certain window as shown in Fig. 2. In a transformer block, for a given token sequence T ∈ R b×n×d within a window, they are first mapped to three matrices of query, key and value Q, K, V ∈ R b×n×d .Here b is the batch size, n is the number of tokens, and d is the token embedding dimension.The output of the multiple head attention (MHA) is calculated as where 1 is the scaling factor determined by the network depth.Instead of using the traditional linear multiple layer perceptron (MLP) to calculate Q, K, and V, the proposed quadratic transformer adopts quadratic multiple layer perceptron (QMLP) to obtain the three matrices, where are linear operators.The biases are omited for simplicity.After MHA, layer normalization (LN) and shortcuts are also applied for more feature inference.As shown in Fig. 1, the quadratic swin transformer block is characterized as: where T ′ , T q ∈ R b×n×d .Hence, the quadratic swin transformer block will focus on the quadratic features inference which is evidenced to be more powerful in encompassing real-world representation [27], [31].The introduction of quadratic features allows the SQ-Swin to capture complex, non-linear relationships.It is a vital capability for modeling the intricate factors contributing to lettuce browning.

B. MODEL TRAINING
The SQ-Swin is trained with two steps: The first step is pretraining on ImageNet data, and the second is finetuning on lettuce images with reptile meta-learning.ImageNet is a vast and diverse dataset comprising a wide range of images covering various objects and concepts.It has been widely explored as a pretraining dataset for various downstream tasks [44], [45], [46].Pretraining on such a comprehensive dataset offers numerous advantages for lettuce browning prediction.First, it facilitates feature extraction, allowing the model to capture valuable visual patterns, shapes, and textures, essential for recognizing the discoloration and browning in lettuce.Second, it dramatically reduces training time and computational resources, as the model has already acquired foundational knowledge from ImageNet.Third, the pretrained model exhibits improved generalizability, making it more robust against variations in lighting conditions, backgrounds, and other factors that may affect lettuce appearance.Moreover, it improves data efficiency, especially when you have limited labeled data for a specific task, as it can adapt its learned features to the specific downstream task.
After pretraining, the SQ-Swin can be trained regularly like common deep learning models using optimization algorithms such as Adam or stochastic gradient descent (SGD).However, since limited data are available in the lettuce browning prediction task, the model is easily susceptible to over-fitting.We adopt reptile meta-learning in the training process to alleviate this problem, a typical first-order gradient-based meta-learning algorithm for few-shot learning problems [47].Specifically, given training batch x b in a step with model parameters φ.A traditional optimizing algorithm such as Adam or SGD simply updates the parameters as where φ denotes updated parameters and U is the optimizing algorithm.As shown in Fig. 3, in reptile the model will sample different tasks τ on x b , then, the parameters are updated as follows: where k stands for the steps of the optimizing operator U like Adam or SGD, and η denotes the updating step for the new gradients.If k = 1, the reptile corresponds to the base optimizer.However, when k > 1, the update would not assemble taking a gradient step in the base optimizer.It will include the terms from second-and-higher order gradients.Thus, the reptile will converge to a minimum that is very different from the base optimizer and fits better on the few-shot learning problem.

FIGURE 3.
The update of the model parameters φ using reptile.
In combination, the two-step training approach can potentially boost the model's regression capacity by harnessing the power of transfer learning from a large, diverse dataset and fine-tuning on a specialized dataset with the benefits of metalearning.This approach not only accelerates training but also enhances the model's generalizability and performance.

IV. EXPERIMENTS AND RESULTS
This section presents the dataset preparation, the experiment details and regression results.The SQ-Swin is implemented on the fresh-cut lettuce dataset collected by our group.The proposed model is quantitatively compared with other state-of-the-art models.Moreover, the interpretability of the proposed SQ-Swin is derived for more reliable application.

A. DATASET PREPARATION
The dataset studied in this paper is created from scratch by our group.Specifically, lettuce samples were grown, harvested, and processed according to Teng et al. [48] and Peng et al. [49].Digital images of the cut lettuce were acquired inside a shooting tent with controlled lighting and shooting distance using a Nikon D1X cameras.Then, the browning level of each image is determined independently by three trained evaluators.Finally, the label of each image is an average of the three scores from the practitioners.There are 200 whole lettuce images of 7360 × 4912 in total with different browning levels, of which 80% random images are used for training and 20% for testing.Experimental results are averaged based on five-fold cross-validations.The browning scores range continuously from 0 to 3.0, corresponding to different brown levels as indicated in Table 1.

B. EXPERIMENT SETTINGS
All experiments are conducted on Ubuntu 18.04.5 LTS, Intel(R) Core (TM) i9-9920X CPU @ 3.50GHz using PyTorch 1.7.1 [51], CUDA 10.2.0.with four NVIDIA 2080TI 11G GPUs.Below are the details of the experiment settings: • We adopt the default version with {2, 2, 6, 2} layers in the four stages for the numbers of layers from the base Swin transformer modules.Between each stage, there is a downsampling and window shift operation to facilitate diverse representations.
• The window size for each Q-Swin transformer stage is 7, and the shift is 1  4 of the corresponding feature map.• After pretrained on the ImageNet, an MLP of size 100 is attached on top of the SQ-Swin for the feature extraction.Then, a regression head is appended for browning level prediction.
• For reptile meta-learning, the mini-batch size for each sampled task is 32, and the inner epoch is 4 with a learning rate of 0.6.
• In the meta step of the reptile, the model is trained using Adam through 200 epochs with a batch size of 256.Except for the quadratic terms, all other parameters are optimized with an initial learning rate of 1e-4 and scheduled learning rate decay of 0.2 at epochs 100 and 150.
• The training of the quadratic terms utilizes the Relinear strategy.The quadratic terms are initially set untrainable and begin to train on epoch 50.The initial learning rate of the quadratic terms are 1e-6, and then it is decreased to 2e-7 and 4e-8 at epoch 100 and 150, respectively.

C. MODEL EVALUATION
Three metrices: mean absolute error (MAE), mean square error (MSE) and Pearson's correlation coefficient (PCC) [52] are employed to measure the prediction performance quantitatively.Given the true labels y and the model predictions ŷ, MAE is expressed as MAE(y, ŷ) =  and interpretation of the result.Furthermore, to better evaluate the performance on the outlier data, MSE is also adopted MSE(y, ŷ) = 1 N ||y − ŷ|| 2 , since it tends to penalize more significant prediction errors than smaller one, thus magnifying or inflating the mean error.Next, PCC is employed to measure the association or the statistical relationship between the prediction and the truth to make the prediction more reliable.Using covariance in the formation, it informs us the linear correlation between the two variables.

D. COMPARATIVE ANALYSIS
The SQ-Swin is fairly compared with the state-of-the-art deep convolution neural network and transformer models.The convolution models include VGG11 [53], Ghostnet [54], and Resnet18 [6].The transformer models include Twins [55], Deit [56], TNT [57], Visformer [58], and Swin transformer [26].We reimplement these models on the lettuce data according to their officially disclosed codes.Specifically, the initial parameters of these models are transfered from the pretrained models in Pytorch Image Models [59].Then, a feature extraction layer and a regression head are appended to these models to match the lettuce browning prediction task.

1) QUANTITATIVE RESULTS
Table 2 shows that all studied models are effective in predicting the lettuce browning level.All other convolution or transformer-based competitors can achieve a low MAE around 0.12, an MSE around 0.02, and a PCC around 0.96.However, with the siamese and reptile design, the proposed model greatly surpasses other state-of-the-art deep models in terms of the three metrics, even utilizing a traditional linear swin transformer as the base network.When the quadratic transformer is introduced, the performance further surpasses the traditional linear one with an MAE of 0.0203, an MSE of 0.0018, and a PCC of 0.9976.Therefore, we can conclude that the proposed SQ-Swin significantly outperforms the traditional methods and other deep learning-based backbones in these key metrics.
2) QUADRATIC VS.LINEAR TRANSFORMER Besides, we also illustrate the prediction losses and the siamese losses during the training of the quadratic (SQ-Swin) and linear (S-Swin) transformers.As shown in Fig. 6, both losses of the two transformers drops in the same way in epoch 0-50, because quadratic terms are frozen at the start.Then, the loss of the linear transformer drops faster than the loss of the quadratic transformer in epoch 50-100.However, when the learning rate decreases in epoch 100 and 150, the quadratic loss drops more while the linear loss drops but rise again.The zoomed curves of epoch 60-100 also show that the quadratic curve is smoother than the linear one.This further indicates a more stable training process.
After training, as shown in Fig. 5, the feature maps of the quadratic transformer and the linear transformer are also visualized.In the early layers, the two transformers follow very similar learning patterns.The feature outline  and texture are nearly the same.However, when the networks go deep, the features gradually become distinct.The last layer indicates that the quadratic transformer can learn more unified and polarized features, which to some extent, provides more representative features for the browning evaluation, thereby obtaining a more accurate prediction with lower MAE.In summary, the quadratic transformer is more powerful in loss optimization, training stability, and feature representations compared with the linear transformer.

3) INFERENCE TIME AND PARAMETERS
The traditional lettuce image analysis uses Progenesis QI software (Nonlinear Dynamics, Newcastle, U.K.) for tabular data extraction.It takes about 0.5-16 minutes of inference time, and then it uses principal component analysis (PCA) to extract principal components [60].The whole process is rather time-consuming, thus hindering its real-world applications.Therefore, we are motivated to study the inference time of the proposed SQ-Swin on a single whole image.During inference, the SQ-Swin evaluates 40 randomly selected multi-scale patches from each testing image of 7360×4912, then the averaged score is used as the prediction.Table 3 shows that the inference time of the two deep learning methods is within one second which is much shorter than that of the traditional model.Moreover, the employment of the siamese architecture further expedites the inference process since the siamese models can process two images simultaneously in theory.Further, The number of the parameters and multiplyaccumulate operations (MACs) of the SQ-Swin are also studied.As shown in Table 4, the SQ-Swin has few more parameters because the quadratic mapping introduces extra computation.But the computation cost is still comparable to that of the linear transformer (S-Swin).Please note that the MACs are fewer in SQ-Swin because we rewrote part of the inherent torch code.

E. VISUAL INTERPRETABILITY
In real-world applications, interpretability is a desired property of any proposed model.The higher interpretability of a model is, the more people will comprehend how the decisions are made.Therefore, to further boost the reliability and the applicability of the SQ-Swin, we derive some visual interpretation of the model by leveraging the inherent attention mechanism in the model.Specifically, we extract the attention map Att(Q ) from the multiple head attention module and analyze the internal patterns of the attention maps.By visualizing the initial attention map of the pretrained SQ-Swin in Fig. 7, it is noticed that the attention basically focuses on the desired fresh-cut lettuce pieces.Moreover, by investigating the lettuce parts on which the attention centers as indicated by the red dotted circle in Fig. 7, it illustrates that the attention tends to focus on the corner or the boundary of the lettuce pieces.Therefore, we conclude that the SQ-Swin somehow relies on the corner or the boundary part of the lettuce pieces to make a browning prediction decision.This corresponds to the fact that the browning process on a fresh-cut vegetable piece usually starts from the corners and boundaries.

F. ABLATION STUDY
In the SQ-Swin transformer, several important methods are employed to boost the model performance: multi-scale patch, pretraining, siamese architecture, reptile meta-learning, and quadratic transformer.In this part, we intensively investigate the impacts of these methods.MAE is utilized to quantitatively evaluate the prediction performances since it informs us the intuitive gap between the prediction and the truth.Table 5 indicates that all studied methods can effectively boost model performances.For example, adopting the pretraining, siamese architecture, and the reptile learning method improves the performance with an MAE above 0.4.Following up, the multi-scale patch also decreases the prediction error by 0.0337.However, the performance of the quadratic transformer further surpasses the excellent scores under these four methods with an improvement of 0.0077.In conclusion, all investigated methods in the SQ-Swin are valid and effective for the lettuce browning prediction problem.

V. DISCUSSIONS A. SQ-SWIN FOR LARGE DATASETS OR REAL-TIME APPLICATIONS
The proposed SQ-Swin demonstrates its capacity in the lettuce data, which is a small data with a limited number of samples.However, real-world applications usually entail processing an extensive volume of images in very large datasets or real-time scenarios, such as in CoCo [61], Amazon Customer Reviews [62], Autonomous Driving [63], and Online Video Streaming [64] data.Naturally, we are interested in answering the question of whether the SQ-Swin model can be potentially employed in these settings.When the suitability of a model is evaluated on large-scale datasets or real-time applications, factors including the GPU memory consumption, the model training speed, and throughput are majorly evaluated on the model development side.Regarding the GPU memory, our model can be trained on four RTX 2080TX 11GB GPUs with an image size of 224 × 224 and a batch size of 256, approximately equivalent to a batch size of 64 on 11GB memory.In contrast, recent advances in diffusion models [65] and large language models,1 such as LLAMA [66], demand substantial resources, including months of training on thousands of A100 GPUs.The memory consumption and computational overhead of the SQ-Swin remain minimal compared to these extensive models.In terms of training speed, we employ the Relinear strategy, ensuring rapid convergence.Empirically, the SQ-Swin will take 0.19 seconds per iteration, while the linear version takes 0.18 seconds per iteration.The training speed of the quadratic transformer is very close to that of the linear version.Regarding the inference speed, our results show that the proposed SQ-Swin can process a single whole image in 0.62 seconds, which is faster or comparable to other general deep learning methods of 0.85 seconds.Consequently, the model's impressive memory consumption, training speed, and throughput position it as a viable choice for applications involving large-scale datasets or real-time requirements.

B. COMPLEX IMAGE SETTINGS
This study primarily focuses on evaluating the SQ-Swin model's performance in ideal lettuce conditions.Indeed, the real-world scenarios introduce various variations, including fluctuations in lighting conditions, the presence of different lettuce cultivars, and the existence of other objects within images.Theoretically speaking, the proposed model is designed with several mechanisms including pretraining, Siamese design, and multi-scale patch extraction procedure to enhance its robustness across these diverse image settings.First, the model benefits from pretraining on the ImageNet, a dataset that encompasses a wide spectrum of data settings, including varying lighting conditions, object categories, and image perturbations.This pretraining process equips the SQ-Swin with a foundational understanding of these variations.Moreover, the Siamese design and the multi-scale patch extraction process will force the model to learn more robust and representative features highly associated with lettuce discoloration.This can in turn empower the model with more resilience to the aforementioned variations.Although our current data collection lacks these specific variations for performance validation, we recognize their significance and plan to address them in the future data collection efforts.To further enhance the model's robustness and adaptability to these challenges, we intend to employ a combination of strategies of preprocessing techniques and robust model architectures.Additional data augmentation methods, such as brightness and contrast adjustments, can help the model generalize across varying lighting conditions.Training the model on a diverse dataset that includes different lettuce cultivars and other objects can improve its ability to recognize a wide range of visual patterns.Additionally, regularization techniques like dropout and batch normalization can further enhance model robustness by reducing overfitting.

C. INTERPRETABILITY ENHANCEMENT
In this study, the innate self-attention module is utilized to investigate the inner workings of the proposed SQ-Swin.To further enhance interpretability in the context of lettuce browning detection, several other methods can be potentially employed.Feature importance analysis, visualizations like class activation maps (CAM) [67], and techniques such as Local interpretable model-agnostic explanations (LIME) [68] and SHapley Additive exPlanations (SHAP) [69] values can uncover which aspects of the input data affect the model's predictions.Ensemble models and rule-based systems can potentially further complement these explanations with more transparency.In the future, we will explore these methods to boost the model's explainability.

D. MODEL DEPLOYMENT
Experimentally, our model is trained and deployed on four Nvidia RTX 2080 GPUs with PyTorch to optimize resource utilization and reduce training time.Moreover, our model can also infer on a single GPU or CPU when GPU is unavailable, which can boost the model's applicability.To weaken these hardware requirements for practical use, mixed-precision training in PyTorch can be utilized to reduce memory demands while maintaining model accuracy.For real deployment, we can select an appropriate deployment framework such as FastAPI [70] and serialize our model accordingly, then design an API to expose the model's functionality, considering input and output formats, and ensure scalability to handle varying loads.Subsequently, we will host the serialized model using a cloud instance like AWS, a server, or a container with appropriate resources.Next, we can implement monitoring, security, and thorough testing, and automate the deployment process with CI/CD pipelines.

E. QUANTIFYING UNCERTAINTY
Similar to the conventional deep learning models, the proposed SQ-Swin currently delivers point estimates for lettuce prediction.However, real-world applications often demand a deeper level of insight, specifically the ability to quantify prediction uncertainty.To address this expectation in the context of lettuce browning detection using the SQ-Swin, there are several specific strategies we can explore in the future research.Two promising techniques are to implement Bayesian neural networks (BNNs) [71] or Monte Carlo Dropout [72] in the model architecture part.These methods extend the traditional deep learning models to provide not only point estimates but also probabilistic predictions for the testing samples.BNNs quantify the uncertainty by learning a distribution of model weights and outputs.Next, by enabling dropout during inference, a model can generate multiple predictions for each input, allowing the estimation of prediction uncertainty.Moreover, the uncertainty can be communicated through metrics like a prediction interval, which provides a range within which the true value is likely to fall.Additionally, techniques like ensemble methods, such as bagging and boosting, can further enhance model robustness and quantify uncertainty.By integrating these techniques into the SQ-Swin, it is possible to not only identify lettuce browning but also provide valuable information about the confidence and uncertainty associated with each prediction, enhancing the model's applicability in real-world scenarios and decision-making processes.Our future work will incorporate the above techniques to quantify the uncertainty in lettuce predictions.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose a siamese quadratic swin transformer for lettuce browning prediction.To the best of our knowledge, this is the first-of-its-kind deep learning model in the lettuce brown detection field.Compared with a typical linear transformer, the proposed SQ-Swin is the first second-order transformer that is more powerful in capturing real-world representations with comparable computation complexity.Experimental results validate its advantages in loss optimization, training stability, and feature representations.In addition, the SQ-Swin model showcases high throughput during model inference, making it highly applicable in real-world scenarios and settings where rapid and accurate lettuce browning assessment is essential.This attribute makes the model a valuable tool for the fresh-cut lettuce industry.Furthermore, our experiments involve visualizing the model's attention maps, which can provide valuable insights into its inner workings.This interpretability feature is critical for understanding the model's decisionmaking processes and for building trust in its predictions.In the future, we will further explore the application of the SQ-Swin to other food science areas.
where w r , w g , w b ∈ R n are weights and b r , b g , b b ∈ R n are biases.σ (•) is regular nonlinear activation function such as ReLU.⊙ denotes element-wise product.We denote w r , b r as linear terms, and w g , b g , w b , and b b as quadratic terms.Notably, that the definition of the quadratic neuron only requires O(3n) parameters, which is much less than the regular quadratic complexity of O( n(n+1) 2 ).Furthermore, to facilitate the training of the quadratic parameters and guarantee the model convergence, Fan et al. introduced a Relinear strategy which includes a special initialization on the quadratic terms: w g = 0, b g = 1, w b = 0, and b b
Fig. 4 also shows some example images with different browning scores.In the multiple scale patch extraction stage, the original patches are randomly extracted from the raw image 7360 × 4912 with size of 1024 × 1024, 2048 × 2048, 3072 × 3072, or 4096 × 4096.40 patches are extracted from a single image.Then, all these extracted patches are resized into 224 × 224 with bilinear interpolation.Benefited by multiple instance learning[50], there are a total of 6400 images for training and 1600 images for testing.Furthermore, two image transformations flipping (up/down, left/right) and rotation (90°, 180°, or 270°) are also applied to augment the dataset.

FIGURE 4 .
FIGURE 4. Some fresh-cut lettuce examples with different browning scores.

FIGURE 5 .
FIGURE 5. Representative feature maps of the linear transformer and the quadratic transformer.

FIGURE 6 .
FIGURE 6.The siamese and prediction losses of the linear transformer and the quadratic transformer.

FIGURE 7 .
FIGURE 7. Attention maps for different lettuce patches.

TABLE 1 .
Different browning score and description.

TABLE 2 .
Quantitative evaluation results of different methods on the testing data.The proposed model is marked by a star, and the best results are bold-font.

TABLE 3 .
Inference time of the traditional model, Densenet121, and SQ-Swin on a single lettuce image.

TABLE 4 .
Parameters of the SQ-Swin and S-Swin.
licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 5 .
Quantitative evaluation results of different methods on the testing data.The bold-font text indicates the new method added to the model in the previous row.The bold-font numbers indicate the best results.