Learning Inference-Time Drift Sensor-Actuator for Domain Generalization

In machine learning tasks, models trained in the source domain often suffer from performance degradation in the target domain due to domain drift or distribution shift. In this paper, we explore the concept of sensor-actuator design in adaptive control to address this domain drift problem and develop a new approach, called learning inference-time drift sensor-actuator (LIDSA) for domain generalization. The drift sensor network consists of a constraint network and a data converter. The constraint network is learned to extract a set of constraints in the source domain and sense the domain drift by detecting the deviation from these constraints, called constraint error, which is correlated with the classification error. The data converter network then maps this constraint error into an effective guidance signal, which can guide the actuator network to adjust the feature to achieve improved discrimination power and better generalization performance. Our extensive experimental results demonstrate that the proposed LIDSA approach improves the performance of domain generalization over the baseline method.


INTRODUCTION
Deep neural networks have achieved remarkable success in various computer vision tasks, but the performance of network models degrades significantly when there is a distribution shift between the source domain where the model is trained and the target domain where the model is tested.This problem is referred to as the domain drift problem [1,2].As an alternative approach, domain generalization (DG) [3,4] aims to train network models in the source domain to achieve better generalization capabilities without accessing the target-domain samples.Specifically, the DG issue considers the accessible source dataset Ds = {(x, y) ∼ ps(x, y)} and inaccessible target dataset Dt = {(x, y) ∼ pt(x, y)}, where x is an image and y ∈ Y is associated label from the set of source classes Y.The covariate shift is one of the domain drift which is often observed in DG tasks [3].The covariate shift, namely, ps(y|x) = pt(y|x) and ps(x) ̸ = pt(x), which means data distribution in the target domain is different from the source domain but labels remain the same [5].In the practical deployment of machine learning models, covariate shift is difficult to avoid and challenging to address, leading to poor generalization performance [6].
To address the domain drift problem, a number of DG methods have been developed in the literature by learning domain-invariant representations, data augmentation, learning strategy, and test-time optimization.Domain-invariant representation methods learn invariant features across source domains with a belief that these features are also robust to the target domain drift.Statistical metrics minimization [7,8] and contrastive learning [9,10,11] methods pull positive samples of the same class together while pushing away the negative groups from different classes across source domains.Adversarial learning methods [12,13] aim to fool the domain discriminator when learning the domain-invariant feature extractor.Data augmentation methods [14,15,16] attempt to increase the diversity of training data distributions or try to simulate the domain drift on the source domains.To improve generalization capability, advanced learning strategies, such as meta-learning which exposes models to domain drift during training [17,18,19], model averaging [20,21,22], and test-time optimization [23,24] have been explored and show promising results.
In contrast to most existing works on DG that attempt to learn generalizable representations, our method aims to design a new inference approach that has the capability of sensing domain drift and adjusting the feature embedding accordingly during the inference stage.We are inspired by the sensor-actuator design in adaptive control where an intelligent system is able to sense the environmental drift and adapt its behavior [25].With this type of environmental drift sensing and actuating function, the system will have improved adaptation performance and generalization power in the new environment.Specifically, sensors are designed to detect perturbations or changes in the current environment.The actuator receives the guidance signal from the sensor and adjusts its action accordingly.
Within the context of domain generalization, the major research problem becomes: how to design a sensor to detect domain drift?How to convert these sensed drift error into an effective guidance signal?How can we use this guidance signal to adaptively adjust the feature embedding for the test sample?To answer these questions, we propose the so-called learning inference-time drift sensoractuator (LIDSA) method.As illustrated in Fig. 1, we first extract a set of constraints in the source domain.The proposed LIDSA method is able to sense changes ei caused by the domain drift by detecting the deviation from these constraints through the constraint networks, which are referred to as constraint errors.The magnitudes of these constraint errors are highly correlated with the classification error.Then, using a learned data converter network, the sensed constraint errors are mapped into an effective guidance signal sg, which guides the actuator network ΦAct to adjust the original feature f into a more discriminative feature f ′ to achieve better generalization.
Major contributions of our work can be summarized as follows: (1) Inspired by the sensor-actuator design in adaptive control, we develop a new sensor-actuator approach for domain generalization.(2) We introduce two types of constraints and learn networks to verify if these constraints are satisfied so as to sense the domain drift.Based on the domain drift sensor information, we design an actuator network to adaptively adjust the feature during the inference time.(3) Our extensive experimental results on benchmark datasets demonstrate that the proposed method is able to significantly improve the performance of domain generalization (DG) tasks.

METHOD
The goal of domain generalization (DG) is to utilize multiple source domains to train a generalizable model for the unseen target samples.In this section, we present our learning inference-time drift sensoractuator (LIDSA) method for DG, which consists of a sensor to detect the domain drift and an actuator to adaptively adjust the feature based on the sensor signals for improving inference performance.

Learning to Sense the Domain Drift
Sensors are important modules in control systems designed to detect changes or events in the environment, based on which the system can take action accordingly.Within the context of DG, domain drift degrades the performance of classification models.We design a sensor to capture the domain drift and encode it into a guidance signal sg.In our proposed LIDSA approach, the sensor for domain drift detection consists of two major components.The first component is the constraint network, which detects the domain drift by verifying if the constraints learned in the source domain are satisfied in the target domain.The second component of the sensor is a data converter network that maps the constraint errors into a guidance signal.
Our central idea is to introduce a set of constraints Ω = {Ωi|1 ≤ i ≤ NΩ} and learn a constraint network in the source domain to verify if the output feature satisfies each of these constraints.Certainly, since the baseline network ΦF and the constraint network are trained in the source domain, the feature f should well satisfy the constraints Ω in the source domain.But, in the target domain, due to domain drift, the feature of the test sample generated by the baseline network ΦF may not satisfy the constraints Ω anymore.In other words, there are errors ei in satisfying the constraint Ωi.We refer to this error as the constraint error.In this work, we introduce the following two constraints to sense the domain drift from two different perspectives: structural constraint and distribution constraint.
(1) Structural constraint.We learn a network ΦG to project the original feature where D is the dimension of the feature, and then impose a structural constraint on f p .When defining the structural constraint, for each sample class c, we randomly select a subset of We then define the following bipolar feature constraint Then, the deviation of the project feature f p from this constraint is defined to be the structural constraint error: With γsc, the projected features of in-distribution samples have significantly larger values in the indices from Λ + c while values in the indices from Λ − c are close to 0. Once there is domain drift for a target sample, this polarity of the projected feature vanishes.Thus, the more serious domain drift is, the less projected feature satisfies the structural constraint, leading to a larger norm of esc.
(2) Distribution constraint.The second constraint is about the feature distribution.In the source domain, sample features are well clustered.However, in the target domain, these clustered feature distributions are perturbed and some features are moving away from the class centers due to domain drift.In the source domain, let O = {O1, O2, • • • , OC } be the feature cluster centers for all classes.In the target domain, the deviations of the feature f from these centers capture the information about the domain drift.The distribution constraint error e dc is defined as the weighted summation of the feature differences from these cluster centers: where Φcorr is a learned distribution constraint network to predict the correlation between the sample feature and the cluster center.
Here, we did not use the direct statistical correlation since we wish the distribution constraint error e dc can be learned to provide more effective guidance for the subsequent feature adjustment.
To provide effective guidance for the actuator to adjust the feature, we design a data converter network, which is jointly learned with the actuator network.We convert the sensed constraint errors esc and e dc into a guidance signal for the actuator network: sg = ΦCvt(esc, e dc ). (4)

Actuator Network Design for Feature Adjustment
In a control system, an actuator responds to the guidance signal by taking an action to maximize the performance objective.Within the context of domain generalization, the feature generated by the baseline network ΦF is perturbed due to the domain drift.To achieve successful domain generalization, we wish to learn the capability of adaptively adjusting the feature to maximize its discriminative power and achieve better classification performance.To this end, we propose to learn an actuator network ΦAct to perform adaptive adjustment of the feature: Note that this adaptive feature adjustment capability should be learned in the source domain, just like a person learning the adaptation skill in school and applying this learned skill when moving to a new environment.In this case, this person will have better adaptation performance.This implies that the actuator network needs to be learned in the source domain, without accessing the target domain samples.To learn this actuator network ΦAct in the source domain, we can simply perturb the original feature f = f + ϵ(f ) and ask the sensor-actuator network to adjust the perturbed feature f to achieve the same classification output as the original feature: where ΦC (•) is the baseline classification head.One way to perturb the original feature f is to alter batch-wise statistics of original features.We use the feature stylization method in [10] to re-scale the variance of the feature distribution with batchchannel-wise statistics and sample new style vectors µsty and σsty from the re-scaled distribution.
where µ and σ are the batch-wise mean and variance statistics of the original features.In our experiments, we find that simply applying this feature stylization as an augmentation technique for network training yields negligible improvement gain or even degrades the performance.However, the performance is improved when it is combined with our sensor-actuator design for feature adjustment.

LIDSA Network Design and Training
The overall network design is illustrated in Fig. 2. First, the baseline network ΦF extracts the original feature f and the source class centers O = {O1, O2, ..., OC } are computed on the original features f .Then, we perform feature stylization on the original feature f to generate the perturbed one f .Second, the structural constraint network ΦG generates the structural constraint error esc from f and γsc and the distribution constraint network Φcorr predicts the distribution constraint error e dc from f and O.Then, the data converter network ΦCvt converts these sensed constraint errors into a guidance signal sg.With the perturbed feature f and the guidance signal sg as inputs, the actuator network ΦAct adjusts the feature f ′ = f + ∆f .The new feature f ′ is then passed to the baseline classifier ΦC to produce the final classification result.
During the training stage, the baseline network ΦF , and the baseline classification head ΦC are fixed.We first train the structural constraint network ΦG which aims to project the feature into a new space to match the structural constraint in (1).The training loss function is given by Once the structural constraint network is trained, it will be fixed during the training of the following three networks: distribution constraint network Φcorr, data converter network ΦCvt, and the actuator network ΦAct.These three networks are end-to-end trained in the whole LIDSA system using the following loss Here, LCE is the cross-entropy loss to measure the final classification performance with the adjusted feature f ′ by the baseline classifier ΦC .The second loss LST (f ′ ) is the structural constraint loss for the adjusted feature f ′ .With this loss, we wish that the adjusted feature will better satisfy the structural constraint.

EXPERIMENT
In this section, we present experiment results and analysis to demonstrate the performance of our LIDSA method.

Datasets and Implementation Details
We conduct performance comparisons on PACS [26], OfficeHome [27] and VLCS [28] benchmark datasets, which all contain images from 4 distinct domains.For each dataset, we choose one of the domains as the target domain for inference and the other three are treated as source ones for training the model.We use Resnet-50 [29] as the backbone network.The structural constraint network ΦG and the converter network ΦCvt are constructed with a fully connected layer.The distribution constraint network Φcorr is built with two fully connected layers.The actuator network ΦAct is a 2-layer MLP containing a ReLU layer and a dropout layer with a dropout probability of 0.5.We follow the training and evaluation in [30], including the dataset splits, model selection and simple data augmentation [21].Following SWAD [21] and PCL [9], we use the Adam optimizer with a learning rate of 5e − 5 and λ in ( 9) is set to be 2.0.

Performance Comparisons
In Table 1, we provide quantitative comparisons with existing methods on PACS, OfficeHome and VLCS datasets with the Resnet-50 backbone.We can see that our LIDSA method improves upon the current best method PCL [9] by 0.3% on the PACS dataset.On the task of Cartoon (C), Photo (P), Sketch (S)→Art-painting (A), our method outperforms the second-best result by 1.2%.For the Art-painting, Cartoon, and Photo→Sketch tasks, we also achieve a performance gain of 1.2%.For OfficeHome dataset, our LIDSA method outperforms the state-of-the-art DG methods by a large margin of 1.2%, showing superior performance.We can see that our method significantly outperforms the second-best result on the Art (A) task by 2.2%.We also obtained the best performance on Product and Real tasks, improving the performance by 1.2% and 1.3% respectively.For the VLCS dataset, we improved the performance by 1.1%.For the tasks of C, L, and S→V, we achieve the highest accuracy 80.4%.We perform t-tests on the performance improvement data compared with the baseline method.We choose the significance level of 0.05, and the p-values for all three datasets are 0.0061, 0.0003 and 0.0011, respectively, suggesting that our method achieves significantly higher accuracy than the baseline method.

Ablation Studies and Analysis
We conduct a series of ablation studies on the PACS dataset with Resnet-50 and provide analysis to understand our method further.Ablation studies.In Table 2, the first row shows the result of the baseline method developed in [21].The second row shows the result of naive training of the sensor-actuator network.Specifically,  Act with the same network structure as ΦG, Φcorr, ΦCvt and ΦAct, respectively.Then, we train these networks with the same loss without the elaborate design of constraint errors.The third and fourth rows show the results achieved with only esc and only e dc , respectively.The last row shows the result of the complete LIDSA method when both two constraint errors are added.We can see that LIDSA improves the accuracy of naive training by about 0.6%.This suggests that our method of sensing the domain drift and adjusting the features during the inference time is more effective than simply training the same network with the same loss functions on the training side.Besides, with both the two constraint errors added, our sensor-actuator design is able to improve the performance of the baseline by about 1%.
Visualization of feature adjustment.In Fig. 3(a), we provide the t-SNE visualization on the Art-painting domain of the PACS dataset to show how the proposed actuator network can adaptively adjust the features to improve the overall classification performance.The left figure visualizes the feature before adjustment.The right one shows the features after adjustment by the actuator network.Colors represent classes.After the adjustment by the actuator network, the sample features are much better clustered.We also compute the mean and variation of each cluster for all four domains.After feature adjustment, the distance of mean between domains decreases by 49.3%, 47.9% and 49.5% for A-C, A-P and A-S, respectively, with 52.5%, 51.3% and 51.4% for the distance of variance, indicating the improved generalization performance of our method.
Source-target domain divergence.To verify our claim that adjusted features have better generalization capabilities, we follow [33] to compute H-divergence to measure the distribution divergence of features among mixed source and target domains for both baseline method and our method on the PACS dataset.We also follow [33] to approximate H-divergence by Proxy A-distance (PAD) 2(1 − 2ϵ) [34], where ϵ is the test error of binary domain classifier.Fig. 3(b) shows that our method has lower PAD values for all four sourcetarget feature pairs than those of the baseline approach on the PACS dataset.This suggests that our method mitigates the distribution discrepancy between the source domains and the target domain than the baseline method, showing better generalization capabilities.

CONCLUSION
In this work, we explore the concept of sensor-actuator design in adaptive control to address the domain drift problem and develop a new approach, called learning inference-time drift sensor-actuator, for domain generalization.We have developed the drift sensor network which consists of a constraint network and a data converter.The constraint network is learned to sense the domain drift by detecting the deviation from two constraints: the structural and distribution constraints.The data converter network then maps these constraint errors into an effective guidance signal, which is able to guide the actuator network to adjust the feature to achieve significantly improved discrimination power and better generalization performance.Experimental results on benchmark datasets demonstrate that our proposed LIDSA approach has improved the domain generalization performance over the baseline method.
C,P),S (b) Proxy A-distance.Fig. 3. Further analysis of our proposed LIDSA method.Correlation between constraint error and classification error.For further insight into the constraints, we investigated the reliability and effectiveness of both constraints in detecting domain drift by analyzing the correlation between the constraint errors and classification errors for target samples.To evaluate their correlation, for the classification error, we use its cross-entropy loss.For the structural constraint error, we measure its magnitude |esc| by defining |esc| = 1 − f p •γsc ∥f p ∥•∥γsc∥ .For the distribution constraint error, we measure its magnitude |e dc | with L1-norm as |e dc | = ∥e dc ∥.In Fig.4, we can see that both |esc| and |e dc | are highly correlated with the cross-entropy loss.The correlation coefficients are 0.8488 and 0.797, respectively.This strong correlation suggests that it is reasonable and effective to use the structural constraint error esc and the distribution constraint error e dc to capture the domain drift and provide effective guidance signals for feature adjustment.

Fig. 4 .
Fig. 4. Correlations between constraint errors and the classification error on Art-painting domian of PACS dataset.