Effects of Loss Function Choice on One-Shot HSI Target Detection With Paired Neural Networks

Implementing reliable few-shot capable classifiers and detectors in machine learning is no trivial task and often requires parsing a large set of hyperparameters and training routine choices to find the best fit. One such choice is the loss function itself. In this effort, we study the validation and test performance of paired neural network (PNN) architectures using contrastive, hard-triplet, and semihard-triplet losses. These are tested by training multiple models to perform one-shot target detection on a custom synthetic hyperspectral image (HSI) dataset with and without reflectance calibration. We find that no single loss function is superior across all data treatments and standard scoring metrics can even disagree among the loss function choice among differing train, validation, and test split choices. We additionally analyze differences in detection map quality for selected test examples illustrating that while most are useful, some models will have more intuitive detection thresholds. Our work suggests multiple loss functions should be considered each time a new dataset and task are encountered to train PNNs for HSI target detection. These findings indicate significant variability in one-shot target detection performance based on the combination of training loss and data treatment but suggest the semihard-triplet loss, combined with a relatively simple reflectance calibration of the imagery, tends to generalize best across the common set of target materials studied.

outperform the adaptive cosine estimator (ACE) when the target library is limited.This modeling approach's success is important as typical matched filter methods, such as ACE rely on increasingly large spectral libraries for template matching, which may not always be available for targets of interest.A thorough example of how an hyperspectral image (HSI) processing system based on ACE detection can be seen in [2] where a great deal of care is given to building a target signature library and augmenting its representation scene-by-scene.By designing a model that does not assume a fixed target set and generalizes over many sources of spectral variability, the library scheme can, thus, be simplified.However, in the work by Anderson et al. [1], multiple questions are introduced regarding which training routines and hyperparameters resulted in a model that was well-conditioned across a range of targets and scene conditions.Subsequent work illustrated the utility of combining uncertainty quantification and semisupervised learning techniques to suppress false positives and leverage unlabeled data, respectively [3].While these are novel combinations for few-shot HSI target detection, they only increased the dimensionality of the hyperparameter space that must be searched to find an optimal target detection model and do not address consistency in target performance.
Originally, contrastive learning began as a supervised technique where one would construct pairs or triplets of labeled training examples to teach the network which examples are similar or dissimilar.These techniques saw success in areas such as image classification and facial recognition [4], [5].More recent studies have aimed to perform unsupervised contrastive learning for image classification and detection utilizing instance discrimination and clustering to derive meaningful pairs or comparisons in unlabeled data [6], [7], [8], [9] and there are recent examples of contrastive HSI classification utilizing methods with varying degrees of supervision [10], [11], [12], [13].Most of these methods focus on broad terrain classification on a limited (or sometimes single) set of scenes as opposed to rare target detection and classification in many scenes with varied conditions, which we study here.For example, Chen et al. [14] propose a novel approach utilizing a linear combination of a spectral and clustering contrastive loss on augmented data from a single scene.However, the testing in this case is only applied to a single target class in that same scene.It is unclear how this and similar approaches perform when applied to different, unknown scenes and targets.Similar approaches raising the same generalization questions can be seen in [13], [15], and [16].Others have applied contrastive HSI approaches to multiple targets albeit for terrain classification where no class was held out for testing as a few-shot example [17].In general, we find that detecting new targets in data collected under different conditions not observed during model training is an understudied yet practically relevant problem that should be addressed.
As a subdiscipline of machine learning (ML) aiming to design models to transform examples in the data into an abstract, learned embedding space or representation, contrastive learning with PNNs allows new previously unseen examples to be compared enabling few-shot learning and classification.This is a desirable trait for HSI target detection and identification as the combinatorial space of target variance, scene conditions, and the exponential number of target materials is far too large to capture by training any single fixed-class classifier.However, unsupervised learning with HSI poses challenges with respect to ensuring valid pairs can be sampled from pixels within the same image.Thus, we continue these studies with the supervised approach.In addition, to focus our study on loss functions and their effects applied across data domains, we do not employ previously utilized semisupervised techniques, such as exponential average adversarial training [3], [18].Our studies' contributions focus on PNN one-shot and few-shot detection sensitivity when adjusting the following major training variables: 1) PNNs trained with three different contrastive loss variants under different hyperparameter samples; 2) PNNs trained with complete atmosphere and time-of-day collected data versus incomplete or limited data; 3) PNNs trained on raw at-sensor radiance data versus reflectance-calibrated data.This presents novel findings for the following: 1) PNN generalization to multiple one-shot detection targets in new scene conditions; 2) PNN generalization to these targets under different data treatment scenarios; 3) an effective model selection routine resulting in reliable one-shot detection performance PNNs.These findings extend the practical applications of PNNs as HSI target detection tools by testing their sensitivity to training loss, training data, data calibration, and different targets in varied HSI scene conditions.

A. PNN Architecture
With the aim of using some of the more advanced components available for designing neural networks, we developed a modular architecture based on the schematic shown in Fig. 1.Here, we primarily utilize the dropout and layer normalization with affine parameters options for all models [19], [20].With this architecture, two examples from the data can be processed in parallel and have their output embeddings compared, typically via some distance or similarity metric.

B. Loss Objective Functions
To train a PNN in a supervised manner, one must select an appropriate loss objective function for gradient backpropagation.We focus on some of the more common implementations, although we do note the existence of many variants in the literature [21], [22], [23], [24].
1) Contrastive Loss: Early implementations based on image classification utilized an intuitive loss objective often referred to as the contrastive loss [25], wherein pairs of samples are considered for training where D e is the Euclidean distance between two embedded examples, m is a margin parameter, and y is a binary indicator that the training pair is either same (y = 1) or different (y = 0).This loss is intended to directly train the model to recognize like and unlike examples.For our HSI workflows, we sample pixels and pair them with signature library members as the prototypes.
2) Triplet Loss: First introduced in the context of facial recognition tasks, the triplet loss is another objective function aiming to simultaneously maximize the distance between anchor-negative pairs and minimize the distance between anchor-positive pairs [5].Triplet loss considers three samples at a time for training: 1) a query point; 2) a positive anchor; and 3) a negative anchor.Thus, the loss function on any anchor (x a ), positive (x p ), and negative (x n ) triplet combination is where f (x i ) is the embedding for the example and α is a margin factor.
3) Hard-and Semihard-Triplet Loss: While triplet loss is a robust objective function, it is not always helpful to calculate this loss for every sampled triplet, as there are exponentially many example pairs or triplets within a dataset.Instead, we utilize online mining strategies during training to reduce the number of sampled triplets.In this way, one can more effectively reduce the number of trivial combinations given a particular model embedding.One such algorithm is the hard-triplet loss, which only computes L t for triplets in the minibatch that produce the largest anchor-positive distance and smallest anchor-negative distance positing that these triplets produce gradients, which result in a more discriminatively powerful model.This approach, however, tends to disregard much of the data in the training set, as many examples do not fall into this category.A "middle ground" approach is the semihard-triplet mining strategy.Here, all anchor-positive pairs are selected for the loss, whereas chosen anchor-negative pairs satisfy the following: (3) These anchor-negative pairs while not explicitly being the hardest are often referred to as semihard.

A. Hyperspectral Dataset and Treatment 1) DIRSIG Megascene Hyperspectral Image Rendering:
For a common dataset with HSI target detection domain relevancy, we utilize Rochester Institute of Technology's Digital Imaging and Remote Sensing Image Generation (DIRSIG) software to synthesize a set of synthetic HSI containing pixelwise abundances and labels for intentionally placed materials in the scene.For all experiments, we utilize DIRSIG Megascene, a large model of a Rochester, New York suburb, as the underlying scene design and place solid reflective disk targets as custom additions [1], [26].We simulate the HSI collection under three times of day (1200, 1430, 1545) and three MODTRAN atmospheres, mid-latitude summer (MLS), subarctic summer (SAS), and tropical (TROP) to generate a total of nine images or scenes [27].Details on reflectance calibration as well as other scene and target parameters are described in [1].Utilizing the imagery both with and without reflectance calibration provides two imagery sets, which we may use to test model generalization and training characteristics.
2) Target Signature Representation: As a signature library to represent prototypes (contrastive loss), anchors (triplet losses), and our one-shot examples for testing, we utilize "pure" DIRSIG rendered spectra for each material with an applied reflectance calibration for reflectance-based models or convert these reflectance spectra to radiance assuming MLS atmosphere, 1200 time-of-day, and the latitude, longitude associated with Megascene.Similar to the derivation of the calibration factor in [1], we derive a simple at-sensor radiance factor that could be computed at inference in a practical setting with where E toa is the top of atmosphere radiance in W • m −2 • μm −1 • sr, t 1 is the sun-to-ground transmissive attenuation, and t 2 is the ground-to-sensor transmissive attenuation.While no angular reflective attenuation is considered here, it can be easily added.We surmise that the effect is negligible given that it is just a scaling constant based on the assumed time-of-day and nadir sensor position.The resulting radiance library spectrum for a given material is then a simple product of K r and its reflectance interpolated to the Megascene dataset bands.

B. Hyperparameter Sampling and Training 1) Data Sampling Parameters:
To train PNNs, one must derive pairs or triplet tuples from the data.We sample these in an online fashion every 10 epochs with a fixed number of samples (N m = 4096) per training class or material.For contrastive models, we sample N m positive and N m negative pairs for each training material.For triplet loss models, we sample N m anchor, positive, and negative tuples for each material.For either case, batch sizes were set to 8192 tuples.
2) Training, Validation, and Testing Splits: An appropriate design for assessing one-to low-shot performance in the context of HSI detection is not trivial.We must consider several axes For materials, we follow the split in Table I.It is worth noting that the validation materials do appear as background but are not explicitly sampled as a positive pair.Thus, we generally consider these examples to be low-shot examples.For testing, all materials but Black Oak Leaf appear only in the right half of the images and are, thus, considered one-shot test classes that are never seen during model training.This natural placement of materials is fortuitous, as these materials also sample a large amount of in-scene variance.
3) Metrics and Selection Criteria: Because we treat each pixel as an example to detect and identify and are not utilizing spatial information in our models for tasks such as segmentation, we stick to two commonly used classification metrics for tracking validation performance: 1) area under the receiver-operator curve (AUROC) and 2) area under the precision-recall curve (AUPRC).We calculate each metric with a "one-versus-rest" approach for each validation material at each epoch and then track the average AUROC and AUPRC to generate the best model checkpoints.Epochs, where an AUROC checkpoint occurs, can be seen in the dashed lines in Fig. 3 for two selected models.
4) Model Hyperparameters: Due to the data sampling routines when training our contrastive learning models, we employ a simple random grid search over hyperparameters of interest.This is favored over more advanced tuning choices due to the observed fluctuations in validation scores over the course of training for some of our models.Early stopping routines such as the asynchronous successive halving algorithm will terminate runs prematurely when tracking validation scores from our models [28].An example of how scores can fluctuate over 1000 epochs is shown in Fig. 3.
Thus, we set our maximum epochs to be high enough (at least 1000) to ensure that we effectively sample enough of the training data and obtain a maximum validation score.The authors note that in many cases, little trend is noticed for most all hyperparameters except for utilizing learned elementwise affine parameters.As a general rule, we train our models with three fully connected blocks and an output block (see Fig.  here.Sampling across the different hyperparameters and 3 loss functions, we obtain 48 experiments each for the 4 dataset experiment combinations.Additional parameters with no observed performance dependence included the following: 1) loss margin ∈ {1.0, 1.5, 2.0}; 2) learning rate ∈ {10 −3 , 5 × 10 −4 , 10 −4 , 10 −5 }; 3) L2 weight decay ∈ {10 −2 , 10 −3 , 10 −4 , 0}.

5) Training and Inference Runtime:
To run a single model through our experimental training, validation, and testing pipeline the elapsed time is 3.6 ± 0.5 days for 10 000 epochs on an Nvidia V100 32 GB GPU.It should be noted that this figure is obtained from elapsed times recorded in our parallel runs, which train upward of 16 models at a time across 8 × V100s on a single node.During inference, for a small spectral library (n = 5) and an image chip of spatial size 115 × 80 with 107 bands, we generate detection maps every 13.3 ± 1.5 ms on an Intel Xeon Gold 6248 CPU and 1.2 ± 0.1 ms on an Nvidia V100.

A. Aggregated Performance Analysis
To begin, we look at the aggregated performance across all scenes and materials in a singular metric, the averaged AUROC.We will discuss each experiment individually and then make comparisons across experiments.
For the reflectance experiment (see Table II), we see that the semihard-triplet loss attains the highest reproducible average score on the validation and test sets for the two data-splitting paradigms, except in the case of the validation set across all scenes.Here, contrastive holds a slight advantage though semihard-triplet tends to be more consistent with a lower standard deviation between runs.What is also notable is that both triplet losses' models tend to generalize better to the test set on average, whereas the contrastive model exhibits a performance decrease.Due to the backpropagation of information in the networks being based on a positive and negative example for each batch member, we expect that this should be the case.Notionally, one would expect these triplet loss models to understand a more general and discriminative set of spectral features than the contrastive case where each batch member is based solely on a positive or negative pair.Another interesting fact is that the use of more scene data does not appear to aid in performance on the test set.
Radiance experiments offered some differences (see Table III), likely based on the higher variability in the background than the reflectance-calibrated data.In these experiments, the contrastive loss emerges as the highest-performing loss on average.However, in most cases, generalization to the test set is much poorer apart from the hard-triplet loss models trained on all available scene data.In this case, the models tended to increase performance from validation to test indicating that, in a much noisier data space, hard-triplet losses can benefit from the straightforward inclusion of more data.
Nevertheless, it is apparent that the reflectance calibration applied improves performance for all model types.The contrastive loss as we have implemented here seems to be a more general choice for either data domain, indicating that it is a good starting point for early experiments.However, the flip from semihard-triplet to contrastive and the better test performance in the all-scenes radiance experiment of the hard-triplet models indicate that no loss choice should be eliminated when tuning models on these tasks.Simple changes in dataset design and treatment can make drastic impacts on performance across the loss functions.The choice of the function itself under these one-low-shot paradigms appears to behave more like a hyperparameter than a broadly applicable ML design choice.

B. Validation-Test Generalization per Material
To analyze how reliably PNNs generalize to multiple new targets, we look at the average AUROC on a per-material basis Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.between the validation and test sets.For these, we focus on the better performing, MLS-1200 only training set runs for radiance and reflectance.We select the best 10 models by sorting via the best validation AUROC average score for each experiment.This gives a more appropriate view of how selected or checkpointed models would perform.We then compute the mean and standard deviation of those models' performance on a per-material basis for both the validation and test materials.Interestingly, there is not much disparity in performance on the validation materials (see Fig. 4).However, when generalizing to the one-low-shot examples in the test set, reflectance-calibrated data hold the clear advantage with higher and more consistent (low standard deviation) performance.However, for the radiance experiment, these models hold a slight advantage in detecting the gold roof shingle test target.This may be due to the reflectance calibration washing out minor features making detection more difficult in this space.

C. PD at CFPR Analysis on Test Results
As a better diagnostic tool to understand how a model may perform when mitigating false positives, we take the recall value at a given detection threshold corresponding to a 5% false-positive rate on the ROC curve as our probability of detection (PD) at constant false positive rate (CFPR) metric.The differences in detection performance with this constraint are stark (see Fig. 5).Except for the gold roof shingle material at a 5% CFPR detection threshold [see Fig. 5(a)], the reflectance trained and tested models provide a far better capability under these constraints.As the detection threshold becomes more stringent to 1% CFPR [see Fig 5(b)], the reflectance models exhibit superior PD scores across all targets, whereas the radiance models lose nearly all capability for some of the test targets.This underscores how important the reflectance calibration is for these models to generalize well, and how Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.comparisons across many kinds of materials are needed to fully understand the performance of models with varying hyperparameters.

D. Detection Map Quality for Synthetic Yellow-Paint Targets
To better understand how detection is made, we select a small image chip from the test half of the MLS-1200 scene containing two yellow-paint disk targets in a somewhat cluttered residential setting with many other reflective objects.The single-band 1.05-μm reflectance image is shown in Fig. 6 along with ground truth and similarity maps obtained from the best MLS-1200 trained reflectance models for each loss type.We note that, while we have some understanding of an appropriate similarity threshold for detection for known targets, we cannot know this threshold at inference for unknown targets.Thus, in the absence of automatic methods, we must rely on the somewhat subjective choice of a similarity threshold t s on the interval [0, 1].In Fig. 6, we see that all loss functions produce a reasonable amount of spatial contrast in the similarity map.The semihard-triplet model appears to do the best job driving the similarity of nontarget pixels toward zero while elevating the paint disk's appearance.The contrastive model appears to elevate similarities to other more reflective materials such as nearby rooftops and the hardtriplet model's map highly elevates the paint disk similarities but appears to have a distribution of scores for all other materials, retaining much of the spatial information of the original image.Nevertheless, a precision score = 1.0 can be obtained for all three models with careful tuning of t s for this limited example (see Fig. 7).In this detection plot, the lowest value of t s is used to obtain a precision score of unity.At the minimum t s , the semihard-triplet model appears to have the fewest number of false negatives, whereas the hard-triplet model has the most.Again, this suggests that the semihard-triplet loss produces a more discriminative model.

V. CONCLUSION
To conclude, this study sought to determine the sensitivity of PNN one-low shot target detection performance when varying training loss functions, data domains, and training data size.Foremost, we note that because shifts in the best-performing loss depend on the size of the training data and data domain, each loss should be considered when the desired training, validation, and testing experimental design are changed and/or when the data domain changes.For the losses studied here, this is straightforward to parameterize when sampling training tuples on the fly as with our model development pipeline.For the general case of performance on our dataset, the best test performance model (selected via its validation AUROC) was a semihard-triplet loss model trained only with MLS-1200 reflectance-calibrated data.For relevant performance scoring as shown in Fig. 5, calibrating to reflectance space provides the best, most consistent performance across all test materials and models on average.However, this is a highly aligned test dataset.It would be useful to extend this to a larger array of new targets in other synthetic or real HSI to determine a broader view of generalization and statistical reasons about the features or bands with which the models may struggle.Regardless, this exhibits promising generalization and performance across multiple held-out targets in new scene conditions, a task not typically studied by most HSI target detection routines.
For detection performance with these models, our selected scenario in Fig. 7 demonstrates that high performance can be achieved on new targets if a good threshold can be determined.In general, the automatic selection of this threshold would be based on something like an ROC curve for a target known a priori.But for the scenario of one-shot detection considered here, we assume no known, labeled target data are available.Thus, a particular novel target's detection threshold cannot be known.Future work should aim to investigate better automatic thresholding measures such as data-driven measures leveraging similar known targets or statistical analysis of the similarity distributions for each new target.In addition, these models rely on spectral information only to make a classification.Utilizing more architectures that leverage spatial as well as spectral information could improve target generalization as the model would not be so reliant on spectral information alone.

Fig. 1 .
Fig. 1.Schematic illustrating the different modes of operation for contrastive loss training, paired inference, and triplet loss training.Note that for contrastive loss L c , y ∈ {0, 1} for different or same pixel x and library a.For triplet loss L t in this schematic, library spectrum acorresponds to the anchor, pixel x corresponds to the positive w.r.t. the anchor's class, and pixel k corresponds to the negative w.r.t. the anchor's class.

Fig. 2 .
Fig. 2. Spatial and scene condition splits.All right halves are always used for testing.MLS-1200 left half is always used for training and validation.The left halves of all other scenes are optionally used for training and validation.TABLE I TRAINING MATERIAL SPLITS BETWEEN TRAINING VALIDATION AND TEST IMAGERY

Fig. 3 .
Fig. 3. Validation curves tracking average AUROC across all materials for a regular (blue curve) and irregular (red curve) model.The epoch where the best score was achieved is marked with a dotted line.Selected curves are taken from models trained on the reflectance-calibrated data.
1) with layer normalization and dropout values set to p i ∈ [0.625, 0.5, 0.375, 0.25] for each succeeding layer i. Fully-connected layer configurations were sampled from fixed hidden layer size sequences for each model shown in the following: 1) [512, 384, 256, 512]; 2) [512, 256, 128, 64]; 3) [128, 64, 32, 32]; 4) [256, 128, 64, 64].Each list indicates the sequential node numbers for each subsequent hidden layer where the last entry is the output layer or embedding size produced by the model.We do note that for the MLS-1200 reflectance training set models, we held a fixed layer configuration at[256, 128, 64, 64].Due to the low impact of model capacity (total nodes) observed across all experiments, we expect this had little to no effect on the outcomes presented

Fig. 4 .
Fig. 4. (a) Per material validation and (b) test AUROC average comparisons for the top 10 MLS-1200 trained reflectance and radiance experiment models.

Fig. 5 .
Fig. 5. (a) PD at 5% CFPR and (b) PD at 1% CFPR detection comparison for test materials.Bar heights are the mean PD values across the 10 top-performing models on the validation set and error bars are the standard deviation.

Fig. 6 .Fig. 7 .
Fig. 6.Single-band image chip from the MLS-1200 test set (left), ground truth, and similarity maps for each of the best models trained with each loss for detecting yellow paint disks in MLS-1200 reflectance-calibrated data.All images are shown in grayscale with black representing small values and white representing larger values.

TABLE II COMPARISON
OF THE REFLECTANCE EXPERIMENTS' AVERAGED AUROC ACROSS SCENES AND MATERIALS AND ALL MODELS TRAINED TABLE III COMPARISON OF THE RADIANCE EXPERIMENTS' AVERAGED AUROC ACROSS SCENES AND MATERIALS AND ALL MODELS TRAINED