Near-Infrared Spectroscopy Combined With Support Vector Machine Model to Realize Quality Control of Ginkgolide Production

Chinese traditional medicine (CTM) has a long-standing history and plays a crucial role in complementary and alternative medicine. However, ensuring the quality and safety of CTM products has been a persistent concern due to the lack of effective quality control methods. This study addresses this concern by leveraging chemometric models, specifically partial least squares (PLS), support vector machine (SVM), and random forest, in conjunction with near-infrared spectroscopy (NIRS) data. These models are applied to establish a comprehensive quality control framework for ginkgolide production. This framework includes predicting terpenolactones content at three key production stage of ginkgolide product development. The collect extensive NIRS data throughout the ginkgolide production process and develop chemometric models using PLS, SVM, and random forest algorithms. These models are rigorously validated through cross-validation and independent testing to assess their accuracy and precision in predicting chemical content and classifying product stages. The result reveal that the SVM model, when applied to NIRS data, demonstrates outstanding performance in terms of accuracy and precision. It excels in predicting chemical content and effectively classifying the various stages of ginkgolide production.

have garnered significant attention due to their pharmacological significance.Ginkgolides, a group of terpene lactones found primarily in Ginkgo leaves, have been shown to possess antiinflammatory, neuroprotective, and vasodilatory effects, making them valuable components in the pharmaceutical and nutraceutical industries [1], [2], [3].However, ensuring the consistent quality of ginkgolide production is a crucial challenge faced by manufacturers and researchers.
The application of spectroscopic techniques in pharmaceutical and herbal medicine quality control has gained prominence due to their non-destructive and rapid nature.Near-Infrared Spectroscopy (NIRS), a vibrational spectroscopy method, has emerged as a powerful tool for assessing the chemical composition of complex mixtures, including plant extracts and pharmaceutical products [4], [5], [6].NIRS offers several advantages, such as minimal sample preparation, cost-effectiveness, and the ability to provide real-time data.Consequently, it has found applications in the quality control of various natural products, including herbal medicines.Traditional analytical methods such as thin-layer chromatography (TLC) [7], [8], gas chromatography (GC) [9], [10], high-performance liquid chromatography (HPLC) [11], [12], gas chromatography-mass spectrometry (GC-MS) [13], [14], and DNA molecular marker techniques [15], [16] have also been extensively employed for quality control in traditional Chinese medicine (TCM).For instance, DNA sequencing has been utilized to discern Cordyceps from its counterfeits at the molecular level [17].However, these methods often necessitate the use of destructive pretreatment procedures for medicinal materials, involving a variety of instruments and chemical reagents, which can be both expensive and time-consuming.
Recently, there has been a growing interest in leveraging near-infrared spectroscopy (NIRS) as a non-destructive and expeditious analytical technique to oversee the manufacturing of Chinese Traditional Medicine (CTM).In ref. [18], NIRS was employed to evaluate the quality of ginseng, a widely utilized herb in CTM.Their findings indicated that NIRS stood as a dependable method for assessing ginseng quality and possessed the capability to detect adulterants effectively.In ref. [19], NIRS was explored for realize a nondestructive qualitative and quantitative approach of hard-shell capsule.Their study revealed that NIRS combined with chemometrics had the potential to forecast the content of saponins and identify adulterants.Nevertheless, there remains a need for further research regarding the application of NIRS in the context of quality control within the CTM production process.In ref. [20] investigates the feasibility of using near infrared reflectance spectroscopy (NIRS) to predict the total phosphorus (P) and phytate-P content of common poultry feed ingredients which, particularly in the context of total phosphorus and phytate-P determination.By systematically evaluating the predictive capabilities of NIRS against established wet chemical methods, it's offers insights into the potential of NIRS as a rapid and non-destructive analytical technique for assessing feed quality parameters.
This paper is directed towards achieving real-time quality control of ginkgolide production through the development of a chemometric model based on Near-Infrared Spectroscopy (NIRS) data.The integrated system comprises a Fouriertransform near-infrared (FT-NIR) spectrometer, a unit for collecting and preparing samples, and software for data acquisition and chemometric analysis.This system has been meticulously designed to capture near-infrared spectra of Chinese Traditional Medicine (CTM) samples at various stages of the production process, spanning from the raw materials to the final products.Subsequently, it extracts pertinent chemical information from these spectra through chemometric analysis.The spectra were subjected to comprehensive chemometric methods aimed at pinpointing characteristic spectral features that correspond to the chemical constituents of CTM.The outcomes of this study conclusively demonstrate that the chemometric model, integrating Support Vector Machine (SVM) analysis with NIRS data, serves as an efficacious tool for the continuous monitoring of the CTM production process and the assurance of its quality.

A. The Working Principle of NIRs
Near-infrared spectroscopy (NIRS) is a non-destructive analytical technique that measures the interaction of near-infrared light with matter.The working principle of NIRS is based on the Beer-Lambert law, which states that the amount of light absorbed by a sample is proportional to the concentration of the analyte in the sample.In NIRS, a near-infrared light source is used to irradiate the sample, and the transmitted or reflected light is measured by a detector.The absorption spectrum of the sample is then obtained by analyzing the differences in the intensity of the transmitted or reflected light at different wavelengths.The near-infrared region of the electromagnetic spectrum is used in NIRS because it contains spectral information related to the vibrational modes of chemical bonds in organic molecules.This allows for the identification and quantification of various chemical components in the sample.The general steps involved in NIRS model establishment include the selection of appropriate samples, the collection of spectral data, the selection of variables of interest, the development of a calibration model, and the validation of the model using an independent set of samples.

B. Sample Collection and Preparation Process
The proper collection and preparation of samples constitute foundational steps in the utilization of the NIRS monitoring system for maintaining the quality control of the Chinese traditional medicine (CTM) production process in Shanghai Shangyao Xingling Technology Pharmaceutical Co., Ltd.These essential procedures are imperative for guaranteeing the precision, uniformity, and repeatability of the spectral data acquired from CTM samples.In the context of this research, rigorous adherence to established protocols for sample collection and preparation was upheld.
Our approach encompassed the acquisition of diverse samples from various sources and regions, encompassing raw materials, intermediate products, and final products within the CTM production pipeline.Special attention was paid to collecting an ample quantity of samples to ensure their representativeness vis-à-vis the respective batch or lot from which they originated.Subsequently, these samples underwent meticulous processing within the preparation unit of the NIRS monitoring system.
To foster homogeneity and reproducibility, a key facet of sample quality, the collected samples were meticulously ground into a fine powder.This step was instrumental in minimizing variations within the samples and ensuring that the subsequent spectral data accurately reflected the chemical composition and characteristics of the CTM samples under investigation.
Pretreatment of NIRS data and spectral preprocessing stands as pivotal stages in the development of a dependable and precise NIRS model.These pretreatment steps are devised to augment the spectral quality while eliminating undesirable variations or noise inherent in the data.Data pretreatment encompasses several essential steps, including normalization, baseline correction, and derivative transformation.
Normalization: Normalization serves the purpose of scaling the spectra in a manner such that the intensity of each spectrum correlates proportionally with the concentration of the analyte.This procedure ensures that the spectra are aligned correctly with the target variables, enabling accurate analysis.
Baseline Correction: Baseline correction eradicates baseline drift, which can arise from factors such as instrument noise, temperature fluctuations, or sample non-uniformity.This correction process ensures that the spectral data accurately represent the analyte's characteristics without interference from extraneous factors.
Derivative Transformation: Derivative transformation is employed to bolster spectral resolution and accentuate subtle distinctions within the spectra.This enhancement aids in the identification of key features and patterns in the data.
Spectral pretreatment complements data pretreatment and encompasses techniques like smoothing, multiplicative scatter correction, and standard normal variate transformation: Smoothing: Smoothing is utilized to diminish noise levels within the spectra, thereby enhancing the signal-to-noise ratio.This results in cleaner and more discernible spectral data.
Multiplicative Scatter Correction: Multiplicative scatter correction is employed to rectify unwanted scatter effects stemming from differences in particle size, sample density, or surface roughness.It ensures that the spectra reflect the analyte's properties rather than interference from physical characteristics.
Standard Normal Variate Transformation: This transformation serves to eliminate systematic variations within the spectra caused by differences in scattering and path length.It Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
standardizes the spectral data to a common baseline, facilitating more accurate and robust analysis.
This can be represented by the equation: where X is the original spectrum, P is the polynomial baseline curve, and Xcorr is the corrected spectrum.
For spectral pretreatment, one common method is derivative transformation, which calculates the rate of change of absorbance with respect to wavelength.This can enhance spectral resolution and remove baseline effects.
where Y is the original spectrum, n is the order of differentiation, W is the window size, and d n Y /dx n is the nth derivative spectrum.
The Savitzky-Golay method is a widely employed technique for spectral pretreatment in analytical chemistry, particularly in the field of spectroscopy.This method aims to enhance spectral resolution and remove baseline effects by applying a polynomial smoothing function to the spectral data.The fundamental principle underlying the Savitzky-Golay method involves fitting successive subsets of adjacent data points with a low-degree polynomial and then evaluating the smoothed values at the central point of the subset.The Savitzky-Golay smoothing process can be represented by the following equation: Where: y (m) j represents the smoothed value at data point j obtained by fitting a polynomial of degree m; c k denotes the coefficients of the polynomial; and y j+k signifies the original data points within the subset centered at data point j.

A. Description of the NIRS Monitoring System
The NIRS monitoring system developed for the quality control of Chinese traditional medicine (CTM) production process is a sophisticated analytical tool that combines a high-performance Fourier transform near-infrared (FT-NIR) spectrometer, a sample collection and preparation unit, and a data acquisition and chemometric analysis software as shown in Fig. 1.
The NIR spectrometer (USA-VIAVI, model patux) is the primary component of the system and is used for the acquisition of near-infrared spectra of CTM samples.It is equipped with a deuterium halogen lamp as the light source and an InGaAs detector.The instrument has a spectral resolution of 4 cm -1 , which allows for high-resolution spectral data acquisition.The spectrometer operates in reflectance mode, with a wavelength range of 900-1700 nm.This range covers the fingerprint region of the near-infrared spectrum, which contains valuable information about the chemical composition of the CTM samples.The High-Performance Liquid Chromatography (HPLC) system utilized for quantifying the terpenolactones content in the CTM samples at the outflow comprises a state-of-the-art Agilent 1290 system, manufactured by Agilent Technologies, USA.This sophisticated analytical instrument is equipped with a range of advanced components and features to ensure precise and reliable chromatographic analysis.The HPLC system is equipped with a diode array detector (DAD), which enables the simultaneous acquisition of absorbance spectra across a broad wavelength range.This allows for the detection and quantification of terpenolactones and other compounds in CTM samples based on their unique UV-visible absorption profiles.The use of a DAD enhances detection sensitivity and selectivity, enabling accurate and reliable quantification of target analytes even at low concentrations.
The sample collection and preparation unit were responsible for the collection and preparation of CTM samples for NIRS analysis.In research settings aiming for comprehensive characterization of Chinese Traditional Medicine (CTM) samples, a sample size ranging from 50 to 100 samples is often considered suitable for achieving statistically meaningful results.This sample size allows for the exploration of sample variability, identification of trends, and establishment of calibration models with sufficient predictive output.The samples are collected from different sources and regions and are prepared according to standardized protocols to ensure consistency and reproducibility.The sample preparation involves grinding the samples to a fine powder and packing them into sample cups for spectral data acquisition.

B. SVM Chemometric Model
The chemometric models were developed using a representative set of CTM samples, and the models were validated using statistical parameters such as root mean square error of prediction (RMSEP), coefficient of determination (R 2 ), and cross-validation error.A chemometric model using support vector machine (SVM) analysis for Chinese traditional medicine (CTM) typically involves the following steps: Data acquisition: CTM samples are analyzed using nearinfrared spectroscopy (NIRS) to obtain spectral data.The spectral data are preprocessed to remove noise and background signals.
Training dataset preparation: A set of CTM samples with known values of the target variable (e.g., active compound content, impurity level) is selected to develop the SVM model.The spectral data from these samples are used as input variables, and the target variable values are used as output variables.
Input variable selection: A subset of the input variables (i.e., wavelengths) is selected using feature selection methods to optimize the SVM model performance.
SVM model development: The selected input variables and output variables are used to train the SVM model using a suitable kernel function.The SVM model parameters are optimized to achieve the best performance.
Model validation: The performance of the SVM model is evaluated using statistical metrics such as root mean square error (RMSE), coefficient of determination (R 2 ), and cross-validation.
Prediction: The developed SVM model can be used to predict the target variable values in new, unseen CTM samples based on their NIRS spectral data.
The specific details of the SVM model, such as the kernel function used and the SVM parameters, depend on the nature of the CTM samples and the target variable.The SVM model can be optimized using various methods such as grid search, genetic algorithms, and simulated annealing.
The SVM model seeks to find a hyperplane that separates the samples with different target variable values.Given a set of CTM samples with N spectral data points and a target variable y.The hyperplane is defined by a set of weights w and a bias term b.The weights and bias term are learned during the training phase using a suitable kernel function.
The SVM model seeks to solve the following optimization problem: where: w 2 is the normL 2 of the weight vector w, which represents the margin of the hyperplane; C is a hyperparameter that controls the trade-off between the margin size and the classification error; x(i) is the spectral data vector for the i-th CTM sample; y(i) is the target variable value for the i-th CTM sample; x i (i) is the slack variable that allows for some misclassification errors in the training data.
The optimization problem is solved using quadratic programming to obtain the optimal values of w and b that maximize the Fig. 2. Spectral data collection at key CTM production stages (a) spectra measured after the Ginkgo biloba eluted with 18% of active components, (b) spectra measured after the Ginkgo biloba eluted with 30% of active components, (c) spectra measured after the Ginkgo biloba eluted with 50% of active components.
margin and minimize the classification error.The kernel function is used to transform the input data into a higher-dimensional space, where the hyperplane can separate the samples more effectively.
The prediction of the SVM model for a new CTM sample x * is given by: where: a i is the Lagrange multiplier obtained during the training phase; y i is the target variable value for the i-th CTM sample; K(x * , x i ) is the kernel function that measures the similarity between the new sample x * and the training sample x i ; sign() is the sign function that maps the output of the SVM model to the binary classes (e.g., positive and negative).

A. Data Collection
In this study, data processing was conducted using a combination of commercial software NumPy and custom-coded algorithms tailored to the specific analytical requirements of the research.The spectral data, as illustrated in Fig. 2, were collected at the first phase of column chromatography.The first Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
phase of column chromatography consists of three main steps: 1) adding 18% active component eluting; 2) adding 30% active component eluting; 3) adding 50% active component eluting.To address potential variability among samples originating from different sources, standardized protocols were implemented for sample preparation and spectral acquisition.This involved uniform grinding of samples to a fine powder and meticulous packing into sample cups to minimize variations in sample presentation.These spectral data is the presence of prominent reflectance peaks falling within the wavelength range of 1329 nm to 1700 nm.These peaks signify specific spectral features that are indicative of the chemical composition or changes occurring in the CTM during its production process.It's important to highlight that after the 50% processing step, these peaks exhibit a slight reduction in intensity when compared to the spectral data from the preceding two stages (18% and 30%).This reduction suggests alterations in the chemical constituents or properties of the CTM as it progresses through the production stages.Additionally, around the wavelength 1187 nm, a smaller peak is observed.Interestingly, the characteristics of this peak differ notably during the 50% processing stage when compared to the earlier two stages (18% and 30%).This variation may hold significance in understanding the transformation of specific chemical components or the emergence of new compounds during the later stages of CTM production.
In Fig. 3, shows an insightful analysis of the concentration distribution of modeling data at various critical stages of the Chinese Traditional Medicine (CTM) production process.Upon close examination of the different stages, we observe that during the Ginkgo biloba eluted with 18% of active components stage, the concentration profiles of CTM components exhibit a relatively even distribution pattern within the range of 0.2 to 1.4.It can observe at this stage, a substantial proportion of CTM samples showcases concentrations that fall within the mid-range.This even distribution might indicate a balanced composition of CTM components during this phase of the production process.Further, transitioning to the Ginkgo biloba eluted with 30% of active components stage, we note a higher degree of uniformity in concentration values, with a notable concentration cluster occurring around the value of 1.0.This concentration pattern is indicative of a higher level of overall quality during this particular stage.The consistency in concentration suggests that the CTM components are present in a well-defined and uniform manner, potentially signifying a desirable product quality.However, at the Ginkgo biloba eluted with 50% of active components stage, a distinct and intriguing phenomenon emerges in which concentration variations become markedly pronounced, particularly around the peak values within the spectra.This observation points towards complex chemical transformations or interactions occurring during the later stages of CTM production, underscoring the importance of monitoring and controlling the production process to ensure the desired product quality.

B. Model Validation
To further explore spectral features, building upon the initial data analysis which subjected the original spectral bands to secondary operations such as first-order derivatives and modulus transformations.This exploration revealed a significant enhancement in the correlation between the firstorder derivative-transformed data and the concentration data.Figs.4-6 shows subsequently, it conducted modeling experiments using Partial Least Squares (PLS), Support Vector Machine (SVM), and Random Forest (RF) regression techniques, Fig. 4. Results of PLS model, the first line was results of the original spectra, the second line was the results of the spectra with derivative.Fig. 5. Results of SVM model, the first line was results of the original spectra, the second line was the results of the spectra with derivative.
both with the training spectral and the testing transformation rates.These models were then rigorously validated using an independent validation dataset to assess their performance and predictive capabilities.

C. Final Comparison
In a comparative analysis, it was found that the Support Vector Machine (SVM) model for detecting Chinese Traditional Medicine (CTM) components, constructed using first-order derivative-transformed spectral data (SVM+DIFF), exhibited clear predictive advantages in terms of detection accuracy and stability across different production stages as you can see from Tables I-III.This suggests that there are distinct spectral characteristics in the rate of change between spectral bands, which Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 6. Results of RF model, the first line was results of the original spectra, the second line was the results of the spectra with derivative.become evident when different concentrations are involved.SVM is a widely-used method known for its effectiveness in solving small-sample problems and achieving optimal solutions.In this study, we conducted comparative modeling using Partial Least Squares (PLS), Support Vector Machine (SVM), and Random Forest (RF).This evaluation provided further insight into the regression fitting performance of the SVM regression model, as depicted in Fig. 5.The results clearly indicate that the SVM+DIFF model developed in this study holds a distinct advantage in the context of CTM quality detection.

V. CONCLUSION
This paper has presented a comprehensive exploration into the application of Near-Infrared Spectroscopy (NIRS) coupled with Support Vector Machine (SVM) modeling for enhancing the quality control of Chinese Traditional Medicine (CTM) production.while there are challenges associated with the complexity of CTM samples and the requirement for skilled personnel, the potential benefits in terms of enhanced quality control and long-term cost savings make these techniques highly worthy of continued exploration and refinement in the realm of traditional medicine production.The central contribution of this research lies in the successful implementation of NIRS combined with SVM analysis as a promising approach for elevating the quality control standards within CTM production.By harnessing the capabilities of SVM, a robust chemometric model has been developed for analyzing CTM samples.While it is accurate that the model excels in upholding stringent quality control measures, particularly through the analysis of offline samples collected at various stages of CTM production, it is essential to acknowledge that real-time process monitoring is not directly facilitated by the model in its current implementation.Further research can delve into refining the chemometric models, making them even more adept at handling the complexity of CTM samples.In addition, the integration of additional data sources, such as environmental and production variables, could provide a more comprehensive understanding of the CTM production process, leading to better quality control strategies.

Fig. 1 .
Fig. 1.(a) Block diagram of the system and (b) NIRS monitoring system for traditional Chinese medicine production.

TABLE I PERFORMANCE
FOR EACH METHOD FOR STAGE 18%

TABLE II PERFORMANCE
FOR EACH METHOD FOR STAGE 30%

TABLE III PERFORMANCE
FOR EACH METHOD FOR STAGE 50%