PrivacyProber: Assessment and Detection of Soft–Biometric Privacy–Enhancing Techniques

Soft–biometric privacy-enhancing techniques represent machine learning methods that aim to: i) mitigate privacy concerns associated with face recognition by suppressing soft-biometric attributes in facial images (e.g., gender, age, ethnicity) and ii) make unsolicited extraction of sensitive personal information infeasible. Because such techniques are increasingly used in real-world applications, it is imperative to understand to what extent the privacy enhancement can be inverted and how much attribute information can be recovered from the privacy-enhanced images. While these aspects are critical, they have not been investigated in the literature so far. In this paper, we therefore study the robustness of state-of-the-art soft-biometric privacy-enhancing techniques to attribute recovery attempts. We propose PrivacyProber, a high-level framework for restoring soft-biometrics from privacy-enhanced images and apply it for attribute recovery in comprehensive experiments on three public face datasets (LFW, MUCT, and Adience). Our experiments show the proposed framework is able to restore a considerable amount of suppressed information, regardless of the privacy-enhancing technique used (e.g., adversarial perturbations, conditional synthesis, etc.), and that there are significant differences between the considered privacy models. These results point to the need for novel mechanisms to improve the robustness of existing techniques and secure them against adversaries trying to restore suppressed information. We also demonstrate that PrivacyProber can be used to detect privacy enhancement (under black-box assumptions) with high accuracy.


INTRODUCTION
F ACIAL images represent a rich source of information, from which a multitude of attributes can be extracted automatically using contemporary machine learning models, including gender, age, ethnicity, affective state, or even body mass indices [1], [2], [3], [4], [5], [6], [7], [8].However, with the rapid proliferation of automatic recognition techniques in facial analytics, privacy-related concerns have also emerged and now represent a key challenge with regard to the trustworthiness of the technology [9], [10].
To address these concerns, researchers are increasingly looking into privacy-enhancing mechanisms capable of ensuring a trade-off between the utility of the data for facial analytics, on the one hand, and privacy protection, on the other [11], [12], [13], [14], [15], [16].Examples of such mechanisms include: (i) deidentification techniques that try to conceal the identity of the subjects in the data, while retaining other useful attributes and information [13], [17], [18], but also (ii) so-called soft-biometric privacy-enhancing techniques, which aim to perturb (transform, modify) facial images in a way that makes it difficult (or impossible) for automatic machine learning techniques to infer sensitive attributes (e.g., gender, age, ethnicity), while making only minimal changes to the visual appearance of the images.While a considerable amount of research has been done on deidentification technology, soft-biometric privacy-enhancing techniques represents a relatively new research topic that has so far been underexplored.
The importance of soft-biometric privacy-enhancing techniques is highlighted through various application scenarios where concealing soft-biometrics attributes, such as demographics, is critical.For example, when sharing personal images online (e.g., on social media), soft-biometric privacy-enhancement can help to avoid automatic user profiling, attribute extraction and other unsolicited forms of processing.Additionally, in face verification systems, where only the identity information is needed, such techniques can conceal information on other attributes that are not required for verification purposes.Such privacy-enhancing mechanisms represent pivotal tools for addressing societal expectations about the appropriate use of personal data, but also for meeting safeguards and standards defined in dataprotection legislation and privacy acts, such as, GDPR [19], CCPA [20], BIPA [21] and others [13].As noted in recent surveys on biometric privacy [13], [22], other types of techniques exist that deal with different aspects of privacy (e.g., encryption, model training, synthetic data, etc.).However, the focus of this paper is exclusively on soft-biometric privacy-enhancement.
Several powerful solutions have been proposed in the literature for ensuring soft-biometric privacy and preventing automatic extraction of facial attributes, including Fig. 1: The performance of soft-biometric privacy models, ψ, is commonly evaluated by comparing attribute classification performance over the original (reference) face images, I or , and their privacy-enhanced counterparts, I pr .In this paper we study the robustness of existing privacy models beyond such vanilla evaluation scenarios and propose an attribute recovery framework, called PrivacyProber, that allows for more comprehensive evaluations using images with reconstructed soft-biometric information, I re .The bottom part of the figure visualizes the idea through illustrative gender predictions (i.e., probabilities of being male) and the match scores between I or and the transformed images, I pr and I re .Note the difference in gender predictions for the different tasks with and without attribute recovery.
synthesis-based techniques [23], auto-encoder based models [11], [24] or adversarial perturbations [12], [25].While these solutions have been shown to successfully obscure soft-biometric information for selected attribute classifiers, experimental performance evaluations are typically limited to vanilla (or zero-effort) evaluation scenarios, where no attempt is made to reconstruct the obscured attributes.Due to this practice, it is not clear if the privacy levels reported in the literature generalize to real-world applications, where a potential adversary may exploit additional knowledge and invest considerable effort and resources to recover the concealed information.Such zero-effort evaluations, hence, raise questions about the reliability of existing privacy models 1  and their sensitivity to attribute recovery attempts.Comprehensive reliability studies are, therefore, critical for better understanding the capabilities of contemporary privacyenhancing techniques and have implications for their deployment in practice.However, to the best of our knowledge, such studies are largely missing from the literature.
In this paper, we address this gap and explore possibilities for recovering obscured (concealed, suppressed) information from privacy-enhanced facial images with the goal of assessing the reliability and robustness of existing soft-biometric privacy models.To facilitate the study, we develop an attribute recovery framework, called Priva-cyProber, which uses various types of image transformations (learned over clean, non-tampered images) to reconstruct suppressed information from privacy-enhanced facial images.The proposed framework is based on minimal (black box) assumptions and requires no examples of privacyenhanced images to restore soft-biometric information, 1.The term privacy model is used as a synonym for biometric privacy-enhancing technique in this paper for brevity.which makes it applicable to a wide variety (of conceptually different) soft-biometric privacy models.To demonstrate the feasibility of our framework, we conduct extensive experiments with multiple state-of-the-art privacy models and over three publicly available face datasets.The results of our experiments suggest that PrivacyProber is able to recover a considerable amount of suppressed attribute information and that sensitivity to reconstruction attacks is still a considerable issue with existing privacy-enhancing techniques.As an additional contribution, we also show that the proposed framework can be used to detect privacy-enhancement in facial images, pointing to another threat vector with respect to existing privacy models that allows for flagging tampered images and treating them differently from non-tampered data, e.g., using manual screening.
As part of our reliability study (illustrated in Fig. 1), we make the following main contributions: • We conduct, to the best of our knowledge, the first comprehensive investigation into the robustness of softbiometric privacy-enhancing techniques with respect to attribute recovery attempts.We show that despite recent progress in this area, a considerable amount of concealed information can still be recovered from (most) privacy-enhanced images, but also that there are significant differences in terms of robustness between the tested privacy models.• We propose PrivacyProber, a high-level framework for attribute recovery from privacy-enhanced facial images.The framework relies on dedicated reconstruction schemes build around inpainting, denoising, and face parsing models as well as self-supervised adversarial defenses, and requires no access to the evaluated privacy-enhancing techniques or prior knowledge about their inner workings.• We present novel methodology to evaluate the robustness of soft-biometric privacy models.Specifically, we introduce an attribute-recovery robustness (ARR) score that reflects the robustness of privacy-enhancing techniques by comparing attribute-classification accuracy over reference and attribute-recovered images.We demonstrate the value of ARR scores through extensive experimentation with multiple privacy models and datasets.• We show that PrivacyProber can be used to detect privacy-enhancement in facial images and propose an original detection approach, called APEND (Evidence Aggregation for Privacy-Enhancement Detection), that consolidates differences between the probability predictions of an attribute classifier applied to facial images before and after a series of attribute recovery attempts.We show that APEND facilitates the detection of privacy-enhanced images, while ensuring highly encouraging performance when compared to related solutions from the literature.

BACKGROUND AND RELATED WORK
In this section, we provide background information on softbiometric privacy-enhancing techniques and survey the most relevant prior work.For an in-depth coverage of the broader topic of visual privacy and privacy protection of facial images, we refer the reader to some of the excellent and comprehensive recent surveys in this area, e.g., [13], [22], [26], [27], [28], [29].

Problem Definition and Model Taxonomy
Assume an original bona-fide face image, I or ∈ R w×h , and an attribute classifier ξ a : R w×h → {a 1 , a 2 , . . ., a N }, with the attribute labels {a i } N i=1 corresponding to classes {C 1 , C 2 , . . ., C N }.Soft-biometric privacy-enhancement, ψ, aims to produce privacy-enhanced images, I pr = ψ(I or ), from which the class labels a i cannot be correctly predicted by ξ a with high confidence.In general, the goal of ψ is to obscure attribute information from machine learning models, but also to ensure that the appearance of the perturbed image is as close to the bone-fide one as possible, so that the visual content appears similar for human observers, i.e., min ||I pr − I or || Lp , where || • || Lp is an L p norm.These characteristics are illustrated on the left part of of Fig. 1.
As suggested in [30], existing soft-biometric privacy models can in general be grouped into: (i) techniques that try to induce incorrect attribute predictions, and (ii) techniques that try to generate approximately equal class probabilities for all attributes.Solutions from the first group aim to enhance privacy by inducing misclassifications, i.e., ξ a (I pr ) = ξ a (I or ), through various mechanism, e.g., adversarial perturbations and related strategies.With these techniques, an incorrect attribute label is typically predicted from I pr with high probability.Techniques from the second group, on the other hand, commonly rely on (inputconditioned) synthesis methods that enhance privacy by altering image characteristics is a way that makes attribute predictions unreliable, i.e., p(C , where p(C i |I pr ) represents the posteriors of the attribute classes given the privacy-enhanced image I pr generated by ξ a .Recent techniques from both groups are discussed below.

Soft-biometric Privacy Models
A significant amount of research has been presented in the literature recently that addresses different problems related to soft-biometric privacy [11], [14], [16], [25], [31], [32], [33], [34].Mirjalili and Ross [35], for example, were among the first to explore privacy-enhancing techniques that perturb gender information in facial images.The solution, developed by the authors, first performs Delaunay triangulation (over face feature points) to decompose faces into a set of triangles that can be manipulated with the goal of privacy protection.Next, the texture within the triangles is optimized, such that a selected gender classifier generates unreliable predictions, while the input image is perturbed ever so slightly.The authors showed that their solution results in image manipulations that obscure gender information well, while having only a minimal impact on facial appearance and, consequently, on verification performance.
Another notable technique for soft-biometric gender privacy was introduced by Mirjalili et al. in [24].This work described so-called Semi-Adversarial Networks (SANs), i.e., machine learning models that rely on conditional image synthesis to conceal gender information in facial images.Conceptually, SANs represent convolutional auto-encoders that are paired with two distinct discriminators that steer the synthesis process -one for enforcing gender privacy and the second for retaining verification accuracy (i.e., appearance similarities).The proposed SAN models were demonstrated to be capable of efficiently suppressing gender information, while retaining the data utility for identity verification tasks.To improve the generalization capabilities of SANs to unseen attribute classification models, FlowSAN models [11] were introduced in the follow-up work of the same authors.The main idea behind the FlowSANs was to employ multiple SAN transformations one after the other, with the goal of making the privacy enhancement applicable with arbitrary gender classifiers.The FlowSAN models were demonstrated to generalize better than their predecessors, SANs, while providing a good trade-off between privacy enhancement and utility preservation.
In [14], Marialli et al. introduced PrivacyNet [14], an extended SAN model build around Generative Adversarial Networks (GANs).Unlike competing solutions, PrivacyNet was shown to be capable of soft-biometric privacy enhancement with respect to different (non-binary, continuous) facial attributes, including race, age and gender.The model, hence, generalized the fundametal idea behind SAN models to arbitrary soft-biometric attributes (as well as their combinations).We note at this point that the SAN-based family of algorithms is not based on adversarial perturbations, but relies on facial synthesis facilitated by auto-encoders driven by a number of competing discriminators.
An adversarial approach for privacy enhancement of k facial attributes (k-AAP) was described in [12].The proposed approach tries to infuse facial images with adversarial noise with the goal of obscuring selected soft-biometric attributes, while preserving others.k-AAP relies on the established Carlini Wagner L 2 attack [36] and was demonstrated to achieve competitive results with attribute classifiers considered during the construction of the adversarial noise.Similarly as related adversarial methods, k-AAP results in image manipulation that effectively conceal attribute information for machine learning models, while being barely detectable by human observers.Much like the original SAN models, k-AAP performs best with a known attribute classifiers, but struggles somewhat with unseen classification models.A conceptually similar idea involving adversarial noise was also investigated in [25] where the Fast Gradient Sign Method (FGSM) method [37] was utilized to explore the robustness of facial feature perturbations.

Evaluation of Soft-biometric Privacy Models
Quantifying the level of privacy enhancement is a challenging task and requires well-defined evaluation methodologies and corresponding performance scores that provide insight into the characteristics of the tested privacy models.To quantify performance, the majority of prior work in this area exploits automatic recognition techniques trained for extracting various facial attributes.Recognition experiments are then performed on the original and privacy-enhanced images, and differences in the observed classification accuracies are used for performance reporting [11], [14], [16], [33], [34].
Additionally, several scalar performance scores have also been proposed in the literature.Othman and Ross [23], for example, introduced gender suppression levels, a performance score for measuring the success of a synthesis-based privacy-enhancing technique with respect to the induced misclassification of gender.Dhar et al. [38] studied how state-of-the-art recognition models process soft-biometric information and explored how much sensitive information is encoded in different layers of a deep face recognition model.They proposed expressivity as a measure of how much information a given representation carries about a selected face attribute.Terh örst et al. proposed the privacygain identity-loss coefficient (PIC), which measures the gain of privacy with respect to a chosen facial attribute (e.g.gender) but takes the retained verification utility into account as well.The same authors also proposed [34] correct overall/female/male classification rates (COCR/CFCR/CMCR) to score the success of privacy-enhancing algorithms.More recently, the authors of [32] proposed a set of evaluation protocols with associated performance measures to enable reproducible research on soft-biometric privacy.
While the evaluation methodologies reviewed above provide initial estimates about the performance of biometric privacy-enhancing techniques, they only assume zeroeffort evaluation scenarios.In this study, we built on the presented work and propose a more comprehensive evaluation methodology that also considers attribute reconstruction attempts when scoring privacy models.We introduce a novel performance measure that captures the robustness of the models and offers insight into the difficulty of recovering concealed attribute information.

Detection of Privacy Enhancement
Soft-biometric privacy-enhancing techniques introduce changes to the visual characteristics of facial images and can, hence, be seen as a form of image tampering.While a significant amount of work has been described in the literature to detect such tampering, e.g.[39], [40], [41], such detection techniques have mainly been studied only within the information-forensics community.The problem of detecting privacy enhancement, on the other hand, is new and, except for the PREM work in [30], where a detection model based on face super-resolution and prediction divergence was presented, has not been studied widely in the open literature.Different from [30], we demonstrate in this paper that the process of attribute recovery, facilitated by the proposed PrivacyProber, can also be exploited to develop efficient privacy-enhancement detectors and that the information aggregated through different instantiations of PrivacyProber lead to highly robust performance.
Finally, we note that because some privacy models are based on adversarial perturbations, the problem of detecting soft-biometric privacy enhancements is also partially related to adversarial attack detection methods [42], [43], [44], [45], [46].However, because soft-biometric privacy model also include synthesis-based methods (among others) -as discussed in Section 2.2 -the problem of detecting such image modifications is considerably broader.

PRIVACYPROBER
In this section, we describe the proposed PrivacyProber, a framework for the recovery of suppressed soft-biometric attribute information from privacy enhanced facial images.We discuss multiple contributions, including novel schemes for recovering attribute information under minimal assumptions as well as a novel approach for the detection of privacy enhancement using PrivacyProber.

Overview of PrivacyProber
Existing evaluation schemes for soft-biometric privacy models typically compare the performance of an attribute 2 classifier ξ a (•), when applied to the original ξ a (I or ) and privacy enhanced images ξ a (I pr ).While the observed performance difference provides a first estimate of the level of privacy ensured by the privacy models, it also assumes that no attempt is made to recover the information initially contained in the input image I or .As a result, the privacy levels empirically determined with such vanilla evaluation methodology may be overestimated and not representative of the actual capabilities of existing privacy models.The proposed PrivacyProber, described below, tries to address this problem by facilitating performance assessments beyond zero-effort evaluation experiments.Specifically, PrivacyProber seeks to recover concealed image information by transforming privacy-enhanced images I pr in such a way that the privacy enhancement is reversed, or formally, such that the classifier ξ a (•) generates the same predictions for the original, I or , and recovered images, I re , i.e.: where I re = χ(I pr ) and χ(•) is the transformation applied by the PrivacyProber.By estimating the level of privacy through performance differences of the attribute classifier ξ(•) applied to the original ξ(I or ) and attribute-recovered images ξ a (I re ), where I re = χ(I pr ), a more informative estimate of privacy-enhancement performance (or robustness) can be obtained.The main idea is presented in Fig. 1.While in general, the transformation χ(•) could be learnt in a data-driven manner by defining a loss penalizing the difference in classification outputs between ξ(I or ) and ξ(χ(I pr )), such an approach is (i) model-specific and not universally applicable, (ii) requires access to training data for each considered privacy model, and (iii) warrants prior knowledge about the internal mechanism governing the 2. Common attributes considered in the literature include gender, age, or ethnicity.
privacy-enhancement procedure.In this paper we, therefore, follow a more general approach and design Priva-cyProber under the following (minimal) assumptions: • Black-box privacy models: We assume no information about (or access to) the privacy models is available.We only exploit the fact that the privacy models aim to suppress soft-biometrics information, while making minimal changes to the appearance of the images.• Target domain: We assume that privacy-enhancement is applied within a fixed image domain, i.e., on facial images, and not on images of arbitrary scenes.Thus, we construct PrivacyProber based on a set of (i) generative and (ii) domain-specific transformations, which can be used separately or in sequence, as illustrated in Fig. 2. In the next sections we propose several possibilities for recovering suppressed soft-biometric attributes from privacyenhanced images using such transformations.

Generative transformations
Soft-biometric privacy-enhancing techniques modify or partially corrupt the visual content in the original images I or with the goal of reducing the utility of the data for some targeted attribute classification task.The use of generative transformations3 for mitigating such image tampering is motivated by the fact that prior information about (clean, non-tampered) original face images can be incorporated efficiently into generative models (without the need for examples of privacy-enhanced images) and exploited for attribute recovery.Similar approaches have also proven useful for protection against adversarial attacks [47], [47], [48].We propose three generative transformations for implementing PrivacyProber using dedicated schemes based on (i) inpainting, (ii) denoising, and (iii) image reconstruction.

Attribute Recovery Through Inpainting
Inpainting models typically aim to fill in missing pixels in a damaged image and to restore the corrupted content.Our goal, on the other hand, is to recover a complete image I re from I pr with attribute information restored and not just a small portion of the data.We, therefore, design a novel attribute-recovery scheme based on inpainting for this task that sequentially inpaints small parts of the image at the time and then aggregates the results to recover the complete image.The basic assumption with this attribute-recovery strategy is that inpainting can restore (clean) non-tampered content by inferring pixels from contextual information, even if this information was tampered with by a softbiometric privacy-enhancing model.
The proposed recovery procedure, illustrated in Fig. 3, starts with the privacy-enhanced image I pr ∈ R w×h and a set of N binary masks B i ∈ R w×h , where w and h again denote the width and height of the image, respectively, and i ∈ {1, 2, . . ., N }.The binary masks are initialized as matrices of all ones.Next, a chess-like pattern composed of multiple square regions (of size d × d, where d {w, h}) is constructed and placed into the initialized masks.Pixels in Masked: Inpainted: , such that all positions are traversed in both the horizontal and vertical direction.To facilitate inpainting, the constructed binary masks B i are then used to remove pixels from the privacy-enhanced image, i.e.: where represents the Hadamard product and pr denotes the input image masked by B i .In the next step, all pixels set to zero are reconstructed using a predefined inpainting model, which given a set of masked images {I (i) pr } N i=1 generates a set of corresponding (partially) recovered images Finally, the complete attribute-recovered image I re is reconstructed from the inpainted regions only: where Bi is an inverted version of B i used to exclude the original (non-inpainted) areas from re and • represents an averaging operation over non-zero pixels.This type of averaging is needed due to the overlap in the masked regions between binary masks.We denote the presented attribute-recovery procedure as χ in : I pr → I re hereafter and implement it using an off-the-shelf inpainting model.

Attribute Recovery Through Denoising
The second proposed approach to attribute recovery is based on denoising.Soft-biometric privacy-enhancing techniques typically aim to make only minute changes to the input images, I or , and alter their visual characteristics as little as possible.The changes introduced can, therefore, be well accounted for by the high-frequency part of the privacyenhanced images I pr .We model these high-frequency alterations as noise and try to remove them using a denoising procedure.Such denoising strategies have proven useful for removal of adversarial noise [47], and are, therefore, also expected to be useful for reversing soft-biometric privacy enhancement that shares characteristics with techniques based on adversarial examples.Similarly as in the previous section, we denote the denoising-based attribute-recovery procedure as χ d : I pr → I re hereafter and again implement it using an off-the-shelf inpainting model.

Attribute Recovery Through Adversarial Defenses
As soft-biometric privacy-enhancing techniques often rely on adversarial noise, the robustness of privacyenhancement can be also evaluated against existing adversarial defense methods (e.g.[49]).This means that the attribute-recovery component in the proposed Priva-cyProber framework (whose main goal is to evaluate the robustness of privacy-enhancing methods) can also be implemented as an arbitrary adversarial defense algorithm.
Here, we consider a self-supervised approach that relies on an auto-encoder for image reconstruction, as illustrated in Fig. 4. The auto-encoder, optimized for removal of adversarial noise from an input image, consists of an encoder E and a decoder D. The privacy-enhanced image I pr is provided as an input to E, which compresses the image into a latent representation z = E(I pr ).The latent representation z is then decoded using D to produce the reconstructed image I re , i.e.I re = D(z).The result of this process is, therefore, an image I re that is typically free of high-frequency (adversarial) noise that often serves as means for the privacy enhancement.We denote the reconstruction-based attribute recovery procedure as: χ a : I pr → I re hereafter.

Domain-specific transformation
Instead of building on generative models, trained on clean non-tampered data, to restore facial attributes, another possibility is to base attribute recovery on specifics of the targeted image domain.With facial images, for example, automatic attribute inference should be based solely on the facial region and ignore other contextual information, e.g., background.With domain-specific transformations we, hence, aim to incorporate information on facial semantics into the attribute-recovery procedure and exclude image regions irrelevant for attribute classification from the data.We propose one domain-specific transform in this work that relies on face parsing.Specifically, given an arbitrary face parser f p , we extract facial-part information from the privacy-enhanced image I pr and aggregate all part labels that corresponds to the facial region into a binary mask B.
The labeled facial parts produced by the face parser and the corresponding binary mask are illustrated in Fig. 5. Once the binary mask is constructed, it is utilized to exclude background pixels from the image with the goal of making attribute inference less susceptible to artifacts generated by the privacy enhancement: where is gain the Hadamard product.This type of attribute recovery is denoted as χ br : I pr → I re hereafter.

Beyond Zero-Effort Evaluation Scenarios
PrivacyProber, χ, can in general be implemented using any combination of the transformations discussed above, i.e., {χ in , χ d , χ a , χ br }, and utilized to explore the robustness of a given soft-biometric privacy model to attribute recovery attempts.A general high-level framework for using the proposed PrivacyProber in evaluation scenarios that go beyond zero-effort recognition experiments is given in Algorithm 1.

Detecting (Soft-Biometric) Privacy Enhancement
The main idea behind PrivacyProber is to reconstruct facial attribute information obscured by soft-biometric privacy models.Thus, for a given privacy-enhanced input image, I pr , a selected attribute classifier ξ is expected to generate different posterior probabilities p(C k |I pr ) than for the image The main idea behind the (learning-free) detection approach is to compare the posterior distributions before and after processing with different versions of PrivacyProber (PP) and exploit the (aggregated) differences in the generated distributions for privacy-enhancement detection.
processed through the proposed PrivacyProber p(C k |I re ), where C k denotes the k-th attribute class (as defined in Section 2.1).By comparing the posteriors, it is therefore possible to determine whether an image has been tampered with or not.Based on this insight, we develop in this section a novel (proof-of-concept) approach for detecting privacy enhancements with the proposed PrivacyProber framework.
As illustrated in Fig. 6, the proposed approach Aggregates evidence from multiple PrivacyProbers for Privacy-Enhancement Detection (APEND) and consists of the following steps: To quantify the differences in the computed distributions, we use the Chi-square distance, which is an established measure for comparing histograms [50]: where i ∈ {1, 2, • • • , n}.See the right part of Fig. 6 for an illustration.• Step 3: Evidence Aggregation (EA).Finally, we aggregate the evidence generated through the n attribute recovery attempts into the final detection score d f in using a weighted linear combination, i.e., where w i are balancing weights.The main motivation for the aggregation operation is to consider (comple-mentary) recovery evidence for a more reliable detection score.It is worth noting that APEND does not require training to facilitate tampering detection.Unlike the majority of modern detection schemes, which are trained in a discriminative manner with examples of bona-fide and tampered images, APEND is training free and knowledge-driven, i.e., designed around the characteristics of soft-biometric privacy models.

EVALUATION OF PRIVACY MODELS
In this section, we evaluate several state-of-the-art privacy enhancing techniques using standard vanilla methodology as well as the proposed PrivacyProber.The goal of these experiments is to provide insight into the performance, but more importantly robustness of biometric privacy models, and to demonstrate the importance of experimental evaluations that go beyond zero-effort recognition experiments.Furthermore, we also demonstrate how the proposed Pri-vacyProber can be used to detect privacy enhancement (tampering) in facial images -something that (to the best of our knowledge) has not been attempted widely in the open literature so far.

Considered Privacy Models
Four recent (soft-biometric) privacy-enhancing techniques are implemented for the experiments, i.e., the k-AAP method from [12], the FGSM-based technique from [51], the FlowSAN approach from [11], and PrivacyNet from [14].Because FlowSAN can strike a balance between the level of privacy protection ensured and the preserved utility of the facial images, two different versions of the model are considered: i) one with three SAN models arranged sequentially one after the other (FlowSAN-3 hereafter), and ii) one with five sequential SAN models (FlowSAN-5).All five techniques are trained to obscure gender information, which is also the most frequently considered attribute in research addressing soft-biometric privacy [11], [16], [23], [34].The techniques are selected for the experiments because of their state-of-the-art performance and the fact that they rely on different mechanisms for ensuring soft-biometric privacy.
As such, they serve as a representative cross-section of exiting techniques and contribute towards demonstrating the importance of evaluating biometric privacy models beyond zero-effort recognition experiments.

Datasets, Setup, and Performance Measures
To assess the performance of the considered privacy models, three publicly available face datasets are used, i.e., Labeled Faces in the Wild (LFW) [52], MUCT [53], and Adience [54].The datasets ship with the needed gender labels and contain challenging facial images captured in a wide variety of imaging conditions.Moreover, they represent standard datasets for assessing the performance of existing privacyenhancing techniques (e.g, [11], [14], [15], [16]) and are, therefore, also used in this work.
All images are roughly aligned prior to the experiments such that the faces are cropped to exclude background pixels and then rescaled to a standard size of 224 × 224 pixels.The preprocessed images are subjected to privacy enhancement and used in the experimental assessment to evaluate the following aspects of the privacy models: (a) Level of soft-biometric privacy enhancement: The performance of the evaluated privacy models is measured through gender (g) classification experiments with predefined classifiers on the original (o) and privacy-enhanced (p) images.The level of privacy enhancement achieved is determined by comparing ROC curves generated from the two image sets.Additionally, the gender suppression rate (SR) [31] is also reported and defined in this work as: where AU C go and AU C gp denote the area under the ROC curve before and after privacy enhancement, respectively, and the normalization function f (•) corresponds to: The above definition allows us to report performance for (gender) privacy models that aim to induce misclassifications (i.e., invert classifier predictions for binary problems -invert ROC curves) as well as for models trying to induce random gender-classification probabilities (i.e., targeting a random AUC score of 0.5) using a single performance measure.A SR value of 1 indicates perfect attribute (gender) suppression, whereas a value of 0 implies that the suppression has no effect.Evaluating privacy models with SR scores and the approach presented above corresponds to zeroeffort (vanilla) evaluation strategies commonly seen in the literature [11], [24].(b) Utility preservation: Soft-biometric privacy-enhancing techniques aim to retain as much of the original image information as possible, while altering the visual appearance of the input images only slightly.In accordance with standard evaluation methodology [11], [14], [23], [34] utility preservation is, therefore, assessed through verification experiments on the original and privacy-enhanced images, where minimal differences in performance are expected.Because there exists a trade-off between utility preservation and attribute suppression, a quantitative measure taking both tasks into account is reported for the experiments.Specifically, a modified version of the privacy-gain identity-loss coefficient (PIC) is used in this work [16], [31], [32], i.e.: where the identity loss IL is defined with the degradation in verification performance after the privacy enhancement: In where the subscripts suggest that the AUC score was computed from the recovered (r), privacy-enhanced (p) or original (o) images and g(x), i.e., ensures that robustness is measured for all privacy models on the scale [0, 1].Thus, ARR scores serve as a measure of robustness to attribute recovery attempts and take a value of 0 if after the recovery the same performance is achieved as with the original images and a value close to 1 if no information can be inferred from the privacy-enhanced images.To facilitate the evaluation described above, the experimental datasets are split into training and testing parts.
The training parts are used to train gender classifiers for each dataset and matchers for the verification experiments.Because the datasets are unbalanced with respect to gender, the number of male and female subjects in the training and testing sets is (approximately) balanced by randomly excluding images of the more represented gender.It is also made sure that at least two images per identity are present in

Baseline Evaluation of Privacy Models
The first series of experiments studies the performance of the considered privacy models using standard vanilla evaluation methodology.The goal of these experiments is to establish the baseline performance of the models and explore their characteristics.

Privacy vs. Utility
From an operational point of view, a critical aspect of (softbiometric) privacy enhancement is the trade-off between privacy protection and utility preservation the models ensure.The first part of our evaluation, therefore, looks at this trade-off through a series of recognition experiments.1) Baseline performance.To establish the baseline performance of the privacy models in terms of attribute suppression rates, a gender classifier ξ g (a VGG16-based model with a two-class softmax at the top) is trained for each of the datasets and used to steer the privacy enhancement 4 .In accordance with standard evaluation methodology [12], [25], the same classifier is also used to evaluate gender recognition performance with the enhanced images on each dataset.Similarly, a ResNet-50 face recognition model [56] is learned for the utility preservation experiments on images from the VGGFace2 dataset.Here, the output of the last fully connected layer of the learned model is utilized as the feature representation of the input face images.The computed representations are then matched with the cosine similarity measure in verification experiments.Fig. 7 shows the ROC curves of the experiments on the three experimental datasets before and after privacy enhancement.Note that confidence intervals are not included to keep the plots uncluttered.Instead, standard errors are reported for the (scalar) performance scores in Fig. 8.As can be seen, the two privacy models that aim to induce misclassifications, k-AAP and FGSM, result in close to ideal gender suppression rates (SP) of 1 on all datasets, except for LFW, where the Carlini-Wagner attack used with k-AAP is not successful on several test images using our optimization parameters.The suppression rates for the FlowSAN models (which aim to produce random gender classification probabilities for each input image), on the other hand, depend on the number of SAN models used in the sequence.FlowSAN-3, for example, generates lower SR scores than FlowSAN-5, but is, therefore, retaining more identity information, as evidenced by the lower identity losses (ILs) in Fig 8 .PrivacyNet improves on both Flow-SAN models both in terms of gender suppression and identity loss, and overall ensures the best overall privacy-utility trade-off among the evaluated synthesis-based privacy models.When comparing k-AAP and FGSM to the FlowSAN and PrivacyNet models, we observe that the former two models ensure higher overall PIC scores on most dataset.However, this is a consequence of the fact that the FlowSAN and PrivacyNet models do not simply invert classifier probabilities and, therefore, target a more challenging problem, which makes these models applicable to a wider range of application domains.Especially successful in terms of the privacyutility trade-off is FGSM, which achieves PIC scores of close to 1 on all three datasets.
While most of the findings discussed above hold for all considered datasets, there are slight differences with the Adience dataset.Here, the verification as well as gender recognition performance with the original images is lower compared to the other two datasets.This is due the characteristics of the dataset, which features real-world images with extreme appearance variations that differ significantly from those present in LFW and MUCT.As a result, even minor (additional) image degradations lead to significant performance drops, which is reflected in the relatively larger IL scores for all methods on this dataset.
2) Generalization to unseen classifiers.The results discussed above were generated with the same gender classifier, ξ g , that was also used for privacy enhancement on each dataset.To evaluate the generalization ability of the privacy models to unseen classification models, a ResNet-50 gender classifier, ξ u g , is trained on LFW in the next series of experiments and deployed on the remaining two datasets, i.e., MUCT and Adience.Thus, the model used for scoring gender recognition accuracy differs in topology and training data from the model utilized for privacy enhancement.Fig. 8: Scalar performance indicators for the privacy vs. utility trade-off of the privacy models.Results are reported for three datasets and in the form of mean PIC (higher is better), SR (higher is better) and IL (lower is better) scores with corresponding standard errors computed over the 4 test splits -see Overall, the results of these experiments suggest that all tested privacy models offer a certain level of robustness to unseen classifiers, but the relative drop in performance differs significantly from model to model.The FlowSAN models, which exploit several gender classifiers for (soft-biometric) privacy enhancement, appear to be more robust to changes in the classification model used and may be preferred over k-AAP and FGSM if no assumptions can be made regarding the target classifier used with the final application.Even more impressive robustness was observed for the PrivacyNet model, which was trained in an adversarial setting and found to be the most robust w.r.t.unseen classifiers in the vanilla evaluation scenario.

Feature Distribution Exploration
Next, we investigate the effect of privacy enhancement on the distribution of the features generated by the last fully connected layers of the (i) ResNet-50 face recognition model and the (ii) VGG16 gender classifier.Here, the same setup is used as with the baseline experiments discussed above.The goal of this series of experiments is to better understand TABLE 2: Generalization ability of the privacy models to an unseen gender classifier ξ u g .The reported results are computed with a gender classifiers that differs from the one used for privacy enhancement in terms of training data and model topology.Results are reported (in terms of µ ± σ computed over 4 test data splits) for gender suppression (SR), identity loss (IL) and the the combined PIC score.The colored numbers show the relative change ∆ of the mean score compared to the baseline performance, the arrows and colors indicate whether the score increased (up, blue) or decreased (down, red).what is happening at the representation level as a result of the privacy enhancement and to gain additional insight into the characteristics of the privacy models.To study the ResNet-50 (identity) features, images corresponding to the 10 largest classes of LFW are selected and t-distributed Stochastic Neighbor Embedding (t-SNE) is used to visualize the feature distributions (in 2D) before and after privacy enhancement.For the gender features 700 images from LFW are randomly sampled from the dataset for each gender.The t-SNE plots in Fig. 9 show that both gender (top row) as well as subjects/identities (bottom row) are well separated with the original images.After privacy enhancement with k-AAP and FGSM most of the separation between subjects is preserved.The gender distributions, on the other hand, are less clustered and now exhibit a multimodal distribution.Neveretheless, the overlap between the male and female data points is still limited, indicating that the two classes can still be distinguished using a suitable classification model.These observations support the results from the previous section where only small values in IL scores were observed with these two techniques, while the ROC curves from the gender-recognition experiments showed good separation, but with inverted labels.
When looking at the FlowSAN models, a different behavior can be observed.Here, the identity features are not well separated, but for many data points exhibit (reasonably) correct pair-wise similarities.The gender distributions, on the other hand, overlap significantly, with a higher overlap for the FlowSAN-5 model, which is expected given the objective of the privacy enhancement.PrivacyNet appears to combine the best of both worlds and ensures reasonable preservation of the identity clusters, while leading to considerable overlap in the feature distributions w.r.t.gender.The observed distributions point to the shortcoming of the adversarial techniques k-AAP and FGSM, where the (gender) label swap could easily be identified using manual inspection of a few sample images.This is not the case with the FlowSAN and PrivacyNet models.

Qualitative assessment
The impact of the privacy models on the visual appearance of a couple of sample images from the LFW dataset is shown in Fig. 10.As can be seen, k-AAP and FGSM generate • denotes a function composition operator.
privacy-enhanced images that are almost identical to the original images, the FlowSAN models introduce bigger and visually noticeable changes, whereas PrivacyNet processed images are somewhere in between.The bottom row in each of the two examples presents a visualization of the changes introduced by the privacy models.k-AAP and FGSM add a relatively uniformly distributed noise pattern to the input images, while the FlowSAN and PrivacyNet models introduce a structured pattern focused predominantly on the facial area and less so on the background.This observation can be attributed to the design of the privacy models, where FlowSAN and PrivacyNet are designed specifically for facial images, while the other two models are applicable to arbitrary images and classification problems and, hence, are not explicitly targeting specific visual categories, i.e., faces.From a qualitative perspective, the adversarial models have an edge over the FlowSAN and PrivacyNet models, but as suggested in the previous sections this edge comes at the expense of robustness to unseen classification models and the simpler privacy mechanism that aim to induce an incorrect prediction with a chosen classifier.

Robustness to Recovery Attempts
The main contribution of this study is a rigorous evaluation of existing (soft-biometric) privacy models with respect to their performance beyond zero-effort recognition experiments.The next series of experiments, therefore, explores the robustness of the evaluated models against attribute recovery attempts facilitated by our PrivacyProber.

One-Stage and Two-Stage Attribute Recovery
Multiple versions of PrivacyProber are implemented for the robustness experiments using either one-stage or twostage attribute recovery.The simpler one-stage implementations consist of a single (generative or domain-specific) transformation, whereas the more complex two-stage implementations use sequential combinations of generative and domain-specific operations.For the one-stage implementations inpainting, image reconstruction, denoising and background removal are considered as stand-alone recovery options, whereas multiple different combination of image transformations are incorporated into the two-stage implementations.A high-level overview of the considered PrivacyProber variants is presented in Table 3.Note that these combinations are not exhaustive with respect to all possible attribute recovery options discussed in Section 3.
However, they do provide a representative cross-section of the existing options for the experimental evaluation.
For the implementation of the recovery techniques stateof-the-art backbone models are selected.Specifically, the GMCNN model 5 [57] trained on CelebA-HQ [58] is utilized for the proposed inpainting scheme, the auto-encoder from [49] trained on a selection of real-world images of objects is selected for the image reconstruction procedure, the WDnCNN 6 [59] model trained on the Waterloo Exploration Database [60], the Berkeley segmentation dataset [61] and part of ImageNet [62] is chosen for the denoising operation and the DeepLabv3 7 face parser [63] trained on CelebAMask-HQ [64] is selected for the background removal process.The backbone techniques ensure state-ofthe-art performance for each task and come with publicly available implementations and pretrained weights.It is also important to note that the data used to train the backbone techniques does not overlap (in terms of images or subjects) with any of the test datasets used.
The impact of different PrivacyProbers on the visual appearance of a sample face image is presented in Fig. 11.Here, the numbers below the images correspond to probabilities that the subject on the image is male and the color coded frames indicate whether a gender classifier with a decision threshold at 0.5 correctly determines the subject's gender (green) or not (red).As can be seen, all tested privacy models significantly reduce the correct-gender-prediction probability compared to the original image, whereas the majority of PrivacyProbers successfully recover a considerable amount of gender information regardless of the privacy model used.The one-stage implementations based on inpaiting (PP-I), image reconstruction (PP-A) and denoising (PP-D) also retain all of the visible semantic content.In terms of image quality, the inpainting scheme results in a slight loss of image contrast and the use of the auto-encoderbased leads to somewhat blurrier outputs.The denoisingbased implementation, on the other hand, additionally improves on the perceived quality of the facial images.Background subtraction is the only one-stage approach that removes part of the information and drastically changes the image content.Such images may be of limited use for certain applications, but can still facilitate automatic processing and analysis.The two-stage versions of PrivacyProber in general  inherit the properties of the single-stage components and depending on the operations used may again retain all of the semantic content, e.g., PP-DI, PP-DA, or remove part of it due to background subtraction, e.g., PP-DB.

Robustness Analysis
To assess the robustness of the privacy models to attribute recovery attempts, privacy-enhanced images, I pr , from all three experimental datasets are first subjected to the implemented PrivacyProbers and then analyzed through verification and gender-recognition experiments.For the evaluation, the same gender classifier that steered the privacy enhancement on each dataset is again used to score genderrecognition performance.Note that the privacy models exhibited the strongest performance against this classifier (see Figs. 7 and 8), which makes attribute recovery particularly challenging.For the verification experiments, the ResNet-50 face recognition model is utilized.1) ARR analysis.Table 4 presents the attribute recovery robustness (ARR) scores (from Eq. ( 11)) generated from the attribute-recovered images.For each privacy model (and for each dataset), ARR scores for the PrivacyProber that resulted in the lowest robustness are colored red and the scores that correspond to the highest robustness are colored blue.Several interesting observations can be made from the presented results: (i) First, considerably more information can be recovered from the adversarial techniques, k-AAP and FGSM on average, than from the FlowSAN and Priva-cyNet models.For these techniques at least one of the Pri-vacyProbers results in a ARR score in the 0.1 range, which suggests that a significant amount of the gender information can be restored.Among the FlowSAN models, FlowSAN-3 is less robust to attribute recovery attempts than FlowSAN-5, but still significantly more so than k-AAP or FGSM.As expected, the most robust among all models is FlowSAN-5, where the weakest ARR scores across all PrivacyProbers are between 0.4 (on LFW) and 0.55 (on Adience).PrivacyNet is somewhere in between FlowSAN-3 and FlowSAN-5 with the weakest ARR score on LFW being 0.399 and on Adience 0.562.(ii) Second, the robustness of the privacy models varies from dataset to dataset.While the results on LFW and MUCT are relatively consistent for most privacy models, the weakest ARR scores on Adience are higher than the weakest scores on LFW and MUCT.This observation implies that attribute recovery robustness is in parts also affected by the initial data characteristics and not only by the recovery capabilities of the tested PrivacyProbers.Because the gender recognition performance was already weaker on Adience than on the remaining two datasets, even small degradtions from the initial images, I or , cause further degradations.This fact is then also reflected in the success of the attribute recovery attempts.An exception to these observations is PrivacyNet, which performs well on LFW and Adience, but exhibits very low robustness on the MUCT datasets with several variants of PrivacyProber.The reason for this setting lies in the characteristics of the privacy model, which requires images aligned in a specific way.Since the faces in the MUCT dataset are aligned differently than those in LFW and Adience, the privacy enhancement with this data is less stable and, consequently, more susceptible to attributerecovery attempts.(iii) Third, all evaluated PrivacyProbers are able to recover some level of gender information from  the privacy-enhanced images, suggesting that the reported performance with zero-effort evaluation scenarios typically reported in the literature often overestimates the actual capabilities of the existing privacy models.
2) Attribute recovery vs. verification performance.The implemented PrivacyProbers aim to reconstruct attribute information and, as a result, alter the characteristics of the facial images.To analyze the impact of attribute recovery on the trade-off between identity preservation and attribute recovery robustness, we plot the difference in AUC scores between the original and the attribute-recovered images for both verification and gender recognition experiments in Fig. 12. Points indicating that all the identity and gender information contained in the original images was recovered are located at the origin of the plots.These points correspond to the least robust privacy models.Different from the scalar ARR scores analyzed above, the presented plots offer more insight into the behavior of the tested models, but also the characteristics of the attribute recovery attempts.
As can be seen from Fig. 12, the information content can be restored to close to the same level as before the application of k-AAP and FGSM on LFW and MUCT given the most effective PrivacyProbers.On these two datasets, FlowSAN-3-recovered images perform similarly to the original ones in terms of verification performance, but still offer a certain level of soft-biometric privacy.This result points to the robustness of the privacy model, but also shows that gender information can be recovered without affecting identity cues.FlowSAN-5 exhibits the greatest level of robustness to recovery attempts, but at the expense of the largest loss of identity information among all models.PrivacyNet preserves identity information very well on LFW and MUCT, but (across these two datasets) offers robustness to recovery attempts only on LFW.On MUCT, on the other hand, a significant amount of softbiometric information can still be recovered.On Adience, similar observations as above can again be made for FGSM, that is, FGSM is again the least robust of all tested privacy models.k-AAP, FlowSAN-3, FlowSAN-5 and PrivacyNet exhibit higher levels of robustness, i.e., ∆AU C g is compa- TABLE 5: PrivacyProber as an adversarial defense mechanism.While PrivacyProber is designed as an attribute recovery approach for soft-biometric privacy models, it also ensures competitive performance as an adversarial defense.
Attack model AUC after attack AUC after defense AE recovery [49] PP-D recovery (ours) PP-DI recovery (ours) FGSM [37] 0.005 ± 0.000 0.983 ± 0.008 0.957 ± 0.017 0.976 ± 0.012 Carlini-Wagner [36] 0.218 ± 0.059 0.962 ± 0.017 0.848 ± 0.062 0.906 ± 0.040 AdvDrop [65] 0.299 ± 0.040 0.600 ± 0.017 0.659 ± 0.023 0.665 ± 0.022 rably larger, but for most methods (except for PrivacyNet) these appear to be a consequence of a general loss of useful visual information, as indicated by the drop in the verification performance compared to the original images.Among the PrivacyProbers, the most effective (on average) for the robustness analysis, with respect to both gender and identity information (on average) is PP-DI.The combination of image denoising and context-based inpainting appears to be highly effective in reconstructing attribute information and for evaluating the robustness of soft-biometric privacy models, tough other types of attribute recovery may offer additional insights.

Attribute Recovery and Feature Distributions
Attribute recovery attempts change the visual characteristics of facial images, as illustrated in Fig. 10.Because these changes also affect the feature representations generated by the recognition models, we next analyze the feature distributions generated from the attribute-recovered images.The presented plots support the observations made in the previous section.Attribute recovery (with PP-D and PP-DI) contributes towards well separated gender classes in the feature space for k-AAP and FGSM with minimal impact of the separability of the identity classes.For FlowSAN-3 and FlowSAN-5 we see less gender separation due to higher robustness of the privacy models and again minimal impact on identity information.A similar observation can also be made for PrivacyNet, where the gender overlap is reduced compared to the original privacy-enahnced images,while the identitiy separation is not degraded.

PrivacyProber and Adversarial Attacks
PrivacyProber is designed to recover attribute information, when this information is protected by soft-biometric privacy models.While such models often also incorporate adversarial perturbations, they are typically implemented with mechanisms that go beyond adversarial methods for two reasons.Firstly, they strive towards a dual goal, i.e., (i) to conceal information by fooling a classifier and (ii) to preserve the utility of the data (e.g.verification accuracy, image quality etc.).Typical adversarial examples only try to achieve the first goal.Secondly, besides relying on adversarial noise, methods for soft-biometric privacy-enhancement also often include an image synthesis step, whereas classical adversarial methods do not.
Nevertheless, because adversarial methods and soft biometric privacy models share some commonalities, we explore in this section how PrivacyProber fares as a tool for adversarial defense.For the evaluation, we implement three adversarial attack methods, AdvDrop [65], Carlini-Wagner [36] and FGSM [37], and test them again on test images from LFW and with the two best-performing ProvacyProber variants, PP-D and PP-DI.Additionally, we compare PrivacyProber against a recent basline adversarial defense mechanism, i.e., the auto-encoder (AE) defense from [49].The adversarial attacks are designed to induce gender misclassifications and performance is measured in terms of AUC of the ROC cruves, generated in gender recognition experiments.
We observe from the results in Table 5 that PrivacyProber: (i) performs comparable to the state-ofthe-art defense with the FGSM attack, (ii) is slightly behind, but still competitive with the Carlini-Wagner attack, and (iii) yields somewhat better performance with the most recent AdvDrop attack.The presented results suggest that despite the fact that PrivacyProber was primarily designed for robust evaluation of soft-biometric privacy TABLE 6: AUC scores (µ ± σ) for privacy-enhancement detection experiments.The learning-free (black box) APEND approach is compared against the classification-based T-SVM techniques for adversarial attack detection from [46] as well as the super-resolution based PREM detector from [30].models, it can also be used to explore the competitiveness of adversarial attacks.

Detecting (Soft-Biometric) Privacy Enhancement
In this section we now show that PrivacyProber can also be used to efficiently detect image tampering with softbiometric privacy models.To this end, we implement the proposed APEND detector using three diverse Priva-cyProber variants, i.e., PP-A, PP-DI and PP-B.We note that other combinations of PrivacyProber could be used for the experiments with different performance.However, our goal here is to provide a proof-of-concept for the detector and illustrate the benefits of attribute recovery and not to optimize performance indicators.

Quantitative Evaluation
We note again that the problem of detecting privacy enhancement in facial images has, to the best of our knowledge, not yet been studied widely in the open literature.To the best of our knowledge, only the PREM approach from [30] aims to address the same task and is therefore also included in the evaluation.However, because softbiometric privacy enhancement techniques share some characteristics with adversarial attacks, we also select the recent state-of-the art transformation-based detection technique (denoted T-SVM hereafter) from [46] as a baseline for our experiments and compare it to APEND.T-SVM combines features from the discrete wavelet (DWT) and discrete sine (DST) transforms with a support vector classifier (SVM) for tampering detection and requires training data to be able to learn how to discriminate between original and tampered images.We, therefore, consider the following setting for the experimental evaluation to ensure a fair comparison, i.e.: • APEND and PREM: These methods require no examples of tampered images or knowledge of the mechanism used for privacy enhancement and are, therefore, tested in a purely black-box scenario.For the second configuration, T-SVM (B), is trained for detecting k-AAP based enhancement, again only on LFW.The detection models from both configurations are then tested on all datasets and with all privacy models, thus, simulating a black-box evaluation scenario.
Table 6 shows that in the supervised approach, T-SVM, is the most competitive on the datasets and privacy model it was trained on, while the performance with unseen models (and in most cases also with unseen datasets) quickly deteriorates.The APEND and PREM techniques, on the other hand, are training-free and, therefore, generalize better to unseen models and across different datasets.While both PREM and APEND are quite competitive in most experiments, the aggregation of reconstruction evidence integrated in APEND leads to the best overall average performance of 0.940 in terms of AUC, compared to 0.926 for PREM.The added robustness through the aggregation process is especially evident in specific cases, such as for example with the k-AAP technique on the Adience dataset, where APEND outperforms PREM by more than 23% in terms of AUC.The presented results clearly show the added value of training-free tampering detection as well as the aggregation-based approach to privacy-enhancement detection, which lead to highly encouraging results in the presented experiments.

Visual Analysis
Fig. 14 shows a few example face images, where the proposed APEND detector incorrectly flaged the presence of image tampering (i.e., privacy enhancement) at a decision threshold that ensures equal error (EER) rates in the (two-class) gender recognition experiments.In the top row, we show images, where APEND failed to detect privacyenhanced images.In most cases, these images contain strong image artifacts that make it challening to properly recover attribute information using the PrivacyProber variant used for the implementation of APEND.This leads to minute differences in the gender predictions between the privacyenhanced and recovered images and eventually to misdetections.In the bottom row, we shows example images that have been flagged by APEND as being tampered, but in fact represent original images.As can be seen, such images are often of poor quality (due to blur, noise, etc.) and get improved by the recovery approaches in APEND.Thus, the predictions of the gender classifier change sufficiently to flag the images as tampered.While the performance of APEND is highly competitive (and close to ideal on most datasets), the presented examples suggest that there is room for further improvement by, for example, combining APEND with supervised detection techniques that should be able to also perform well with poor quality images.

CONCLUSION
In this paper, we investigated the reliability of softbiometric privacy-enhancing techniques and explored their robustness to attribute recovery attempts.We introduced PrivacyProber, a framework for the recovery of suppressed soft-biometric information from facial images, and used it in a comprehensive experimental evaluation.Experimental results on the LFW, MUCT and Adience datasets showed that there are considerable differences between the tested privacy models, both in terms of visual impact on the privacyenhanced images as well as in terms of the level of privacy ensured.Additionally, we observed that (using our framework) is was possible to recover a considerable amount of suppressed (concealed) attribute information regardless of the privacy model used.However, the robustness of the tested synthesis-based techniques (i.e., FlowSANs and PrivacyNet) was observed to be considerable higher than that of the evaluated adversarial approaches (i.e., k-AAP and FGSM).These findings have considerable implications for future research in the area of biometric privacy enhancement, where more work is needed to improve robustness of existing models.
As another contribution, we showed that the proposed attribute recovery framework can also be used to detect privacy enhancement (e.g., tampering) in facial images.We proposed the APEND detector, and demonstrated that it can detect privacy enhancements with high accuracy even if different privacy models are utilized and data with diverse characteristics is used.This fact points to another threat vector with respect to biometric privacy models in that privacyenhanced images can easily be identified and subjected to alternative means of processing that is less sensitive to artifacts and perturbations infused with the enhancement.
As part of our future work, we plan to extend our robustness analysis to other biometric privacy models, such as deidentification techniques, which are based on different assumptions and require conceptually different models to restore the obscured information.The need for such robust evaluation schemes was also identified as a key issue in recent privacy surveys, e.g., in [13].

Fig. 2 :
Fig. 2: High-level illustration of PrivacyProber.Generative and domain-specific transformations are used (separately or in sequence) to recover attribute information concealed/suppressed by soft-biometric privacy models.

Fig. 3 :
Fig. 3: Illustration of the inpainting-based attribute recovery procedure, χ in , proposed for PrivacyProber.The privacyenhanced image, I pr , is masked N times with the binary masks B i -shown in the top row.The masked regions are then inpainted based on the remaining context.Finally, the attribute-recovered image I re is reconstructed from the inpainted regions only.

Fig. 4 :
Fig.4: Illustration of the attribute recovery procedure using an auto-encoder (χ a ) for image reconstruction, proposed for PrivacyProber.The auto-encoder maps the privacyenhanced image I pr to the output image I re , such that highfrequency components, e.g., adversarial noise, are removed.

Fig. 5 :
Fig.5: Illustration of the domain-specific attribute recovery procedure, χ br , proposed for PrivacyProber.The privacyenhanced image, I pr , is masked with a binary mask B that corresponds to the facial region.Because the background is partially affected by the privacy enhancement, focusing only on the facial area impacts the behavior of attribute classification models.

Fig. 6 :
Fig.6: High-level overview of the proposed evidence Aggregation approach for Privacy-Enhancement Detection (APEND).The main idea behind the (learning-free) detection approach is to compare the posterior distributions before and after processing with different versions of PrivacyProber (PP) and exploit the (aggregated) differences in the generated distributions for privacy-enhancement detection.

Fig. 9 :
Fig.9: t-SNE plots (in 2D) of gender and identity features.The gender plots in the top row are generated from 700 randomly sampled LFW images of each gender (f -female, mmale).The subject-conditioned distributions in the bottom row are generated based on 10 randomly selected images of the 10 largest classes from LFW (marked (a) -(j)).By comparing the distribution of the original and processed images the impact of the privacy enhancement can be observed.

Fig. 10 :
Fig. 10: Illustration of the visual effect of the privacy models on two sample images from LFW.The top row next to each original image shows the privacy-enhanced images (with gender privacy), whereas the bottom row depicts the difference between the original and the modified images.The difference was scaled for visualization purposes (by 10× for k-AAP, 40× for FGSM and 1× for the FlowSAN models) and then normalized to the valid image range.Note that k-AAP and FGSM result in imperceptible appearance changes, while the FlowSAN models introduce visible changes.

FGSM FlowSAN- 3 FlowSAN- 5 PrivacyNetFig. 11 :
Fig.11: Examples of PrivacyProber reconstructions, where the goal is to recover gender information from privacy-enhanced images.The number below each image corresponds to the probability output of a gender classifier -probabilities between 0-0.5 correspond to female and 0.5-1 to male subjects.Note that the privacy enhanced images, I pr , generate incorrect gender probabilities (indicated by the red frames) or significantly reduce the correct-gender prediction probability.PrivacyProber, on the other hand, recovers the correct gender information in most cases -indicated by the green frames.Depending on version of PrivacyProber images of different visual quality are generated.

Fig. 12 :
Fig. 12: Impact of application of different PrivacyProbers on the verification and gender-recognition performance.The xaxis shows the differences in verification performance observed with the original and privacy-enhanced images that were subjected to the proposed PrivacyProber, i.e., ∆AU C v = AU C vo −AU C vr .The y-axis shows a similar difference for gender i.e., ∆AU C g = AU C go − AU C gr .Results presented across all tested privacy PrivacyProbers and experimental datasets.The figure is best viewed in color.

Fig. 13 :
Fig. 13: t-SNE plots (in 2D) for gender and identity features.Results are presented for the original face images (far left) and images processed from with the best performing one-stage and two stage PrivacyProbers, i.e., PP-D and PP-DI.The gender in the top row are generated from 700 randomly sampled LFW images of each gender and the subject plots in the bottom row are generated based on 10 randomly selected images of the 10 largest LFW classes.Best viewed in color.

Fig. 13 compares
the t-SNE based distributions produced by the face and gender-recognition models.Here, the same setup (involving LFW) as in Section 4.3.2 is utilized to generate feature vectors for the plots.To keep the analysis concise, only the best performing one-and two-stage PrivacyProbers are considered, i.e., PP-D and PP-DI.

•
T-SVM: Because T-SVM is supervised and does requires examples of tampered images or training, we design two experimental configurations that test for the generalization capabilities of this detector.With the first configuration, T-SVM (A), is trained for the detection of FlowSAN-5 enhanced images only on data from LFW.

Input:
Attribute classifier ξ a , set of N input images {I or } N , privacy model ψ Output: Performance (robustness) estimate for ψ Implement PrivacyProber, χ, from {χ in , χ d , χ a , χ br }; for Each image in {I or } N do Apply privacy model: I pr = ψ(I or ); Attempt attribute recovery: I re = χ(I pr ); Use ξ a for classification over I or and I re ; end Calculate performance of ξ a over {I or } N and {I re } N ; Estimate robustness of ψ based on results (see Section 4.2 for scoring methodology); the above equations AU C vo and AU C vp are the AU C scores of the verification experiments (v) generated with the original (o) and privacy-enhanced (p) images.PIC is bounded to [−1, 1], with a PIC score of 1 implying an ideal trade-off, i.e., no loss in verification performance and perfect attribute suppression.(c) Robustness: While attribute suppression and utility preservation are standard aspects of privacy models typically evaluated in zero-effort evaluation scenarios, robustness of the models to attribute recovery attempts has so far seen limited attention in the literature.Here, we use the proposed PrivacyProber to recover information suppressed by the studied privacy models.Identity verification and gender classification experiments are then conducted after attribute recovery (r) and the generated ROC curves are analyzed to assess robustness.A scalar robustness measure is derived from the ROC curves in this paper, i.e., the attribute-recovery robustness (ARR), and reported in the experiments, i.e.: ARR = g(AU C gp ) • |AU C go − AU C gr | AU C go

TABLE 1 :
Overview of the main dataset characteristics and image splits used in the experiments.To ensure balanced experimental data and avoid bias in the generated results, the image splits are (approximately) gender balanced.Subjects between the training and testing part are disjoint.Testing is performed over four disjoint data splits to be able to report confidence scores on the reported results .Totals over all 4 test data splits; #AvgIm/Subj -average number of images per subject; m -male, f -female; the testing part for the verification experiments.To ensure a consistent evaluation setup across all datasets and be able to report confidence scores for the experiments, the test images are partitioned into 4 experimental splits.Gender recognition experiments are performed for every test image, while a fixed number of mated and non-mated comparison is performed for the verification experiments.Details on the experimental setup are provided in Table1.

Table 2
provides a summary of the PIC, SR and IL scores generated for this experiment.Here, the colored numbers show the relative change (marked as ∆) in the computed scores when compared to the baseline performance from Fig.8.As expected, all privacy models degrade in performance, both in terms PIC as well as SR score compared to Fig.7: ROC curves displaying the privacy vs. utility trade-off ensured by the evaluated privacy models.Results for different datasets are shown in rows and for different privacy models in columns.All evaluated models are in general able to retain a signifant portion of the verification performance (red curves), while resulting in different gender suppression rates (green curves).Note again that k-AAP and FGSM aim to induce misclassifications (i.e., invert classifier predictions for binary problems), whereas the FlowSAN models aim to produce random gender classification results.Best viewed in color.
Table 1 for the experimental setup.the baseline experiments.The relative drop is quite severe for the adversarial techniques.The PIC scores drop by 17.8% on Adience and by 32.8% on MUCT for k-AAP, and by 23.4% on MUCT and 28.8% on Adience for FGSM.

TABLE 3 :
PrivacyProber variants implemented for the experimental evaluation.The models differ in terms of whether generative or domain-specific components and whether one or two processing steps are utilized for attribute recovery.