Simulation, Learning, and Application of Vision-Based Tactile Sensing at Large Scale

Large-scale robotic skin with tactile sensing ability is emerging with the potential for use in close-contact human–robot systems. Although recent developments in vision-based tactile sensing and related learning methods are promising, they have been mostly designed for small-scale use, such as by fingers and hands, in manipulation tasks. Moreover, learning perception for such tactile devices demands a huge tactile dataset, which complicates the data collection process. To address this, this study introduces a multiphysics simulation pipeline, called SimTacLS, which considers not only the mechanical properties of external physical contact but also the realistic rendering of tactile images in a simulation environment. The system utilizes the obtained simulation dataset, including virtual images and skin deformation, to train a tactile deep neural network to extract high-level tactile information. Moreover, we adopt a generative network to minimize sim2real inaccuracy, preserving the simulation-based tactile sensing performance. Last but not least, we showcase this sim2real sensing method for our large-scale tactile sensor (TacLink) by demonstrating its use in two trial cases, namely, whole-arm nonprehensile manipulation and intuitive motion guidance, using a custom-built tactile robot arm integrated with TacLink. This article opens new possibilities in the learning of transferable tactile-driven robotics tasks from virtual worlds to actual scenarios without compromising accuracy.


I. INTRODUCTION
T HE sense of touch not only provides a diverse range of information from interactions involving physical contacts, such as interactive force, texture, and temperature, but also is considered to be a means of communication in human-human or human-machine interactions. Skin, the largest organ of the human body covering whole limb or torso, possesses a tactile sensing system that has been inspiring robotics community toward the creation of fully autonomous social and task-based machines with the sense of touch [1], [2]. For years, research on this topic, especially on the large-scale mimicking human skin, based on various transducing principles has been intensively investigated [3], [4], [5]. Nonetheless, designing tactile sensors faced complexity in system integration and data processing, since increasing the scale requires a great deal of embedded sensing elements. Recently, vision-based tactile (ViTac) sensors have emerged as an effective method for the implementation of tactile sensing with a simple design [6], [7], [8]. In detail, the deformation of soft artificial skins upon physical contact with an object is detected through the optical tracking of visual features, such as markers or reflective membranes, which is then translated into tactile information, including contact location, force, vibration, object texture, and so on. The ViTac sensors have been found useful in small-scale manipulation tasks using robotics hands/fingers [9], [10]; however, their potential uses in large-scale whole-arm applications have not been comprehensively investigated.
We previously demonstrated marker-based vision-based tactile sensing (TacLink) with the potential to deliver rich contact information from tactile images based on image processing techniques [11]. In addition, we leveraged the use of a supervised learning method with a high sampling rate for the same setup [12], [13]. The former method, through thorough model analysis and calibration, can yield high sensing performance; its complication in modeling and processing is not widely preferred. On the other hand, data-driven methods like the latter method need a huge amount of data to categorize visual representations, which requires a burdensome experimental data acquisition process [12]. This problem would be magnified in applications with large-scale skin and more complex contact scenarios. As a result, there is an emerging necessity for a pipeline that allows simulation-based learning and accommodates the physics of interactive contact in ViTac sensing systems. While visual effects have been reflected successfully in several simulators, such as Gazebo or Unity, interactions between the sensor skin and its external environment are often modeled as rigid contacts [14], [15].
In this article, we propose a novel simulation pipeline toward a framework for a large-scale marker-cum-vision-based tactile sensor [see Fig. 1(a)] that employs the physics engine SOFA 1 to describe complex physical interactions of deformable bodies based on the finite-element method (FEM). The geometry of the skin and the markers (.STL format) were separately fed to Gazebo environment, in which a built-in sensor plug-in was exploited to reproduce imaging patterns (i.e., markers) as visual versions of tactile images captured by cameras. The virtual images and ground-truth tactile feedback (such as forces and displacements distributed at each element node) recorded from simulation are then utilized as input and output (label) data to train a deep neural network, named TacNet. Moreover, in order to apply the effectiveness of the TacNet model to real-world tactile images, we proposed a real-to-simulation generative network (R2S-GN) that uses a generative adversarial network (GAN) to automatically learn how to transform real observations of tactile images into simulated ones for evaluation by the TacNet model [see Fig. 1(a)]. Such a platform is envisaged as a standardized easy-to-apply procedure for a wide range of robotic devices at variant scales to acquire tactile perception [see Fig. 1(b)].

A. Simulation Frameworks for Vision-Based Tactile Sensors
In the literature to date, the two dominant groups of ViTac sensors rely on 1) the intensity of reflective light and 2) relative positions of visual markers to deduce tactile information, represented by GelSight [16] and TacTip [17] sensors, respectively. Recently, advanced simulation tools have been emerging, which reduce the burden in acquisition of real data for learning framework and sim2real transfer approaches to facilitate simulated tactile perception. For instance, to capture depth maps of in-contact objects in GelSight images, Gomes et al. [18] used the robotics simulator Gazebo, which provides a simulated depth camera. Depth-based simulated images were then retouched with a Gaussian filter and Phong's rendering model to approximately recreate the complex inter-reflection of light sources upon the deformation of a membrane. Another sim2real technique applied for GelSight-like sensors was reported in [19], where tactile images were accurately generated with the help of physics-based rendering. Wang et al. [14] established TACTO-a promising open-source simulation layout harmonizing physics engine Py-Bullet and rendering engine Pyrender, where rigid contacts were employed for depth-map-based RGB tactile image rendering at hundreds of frames per second. The feasibility of this simulator was validated with two popular versions of finger-sized ViTac sensors: hemisphere-shaped OmniTact [20] and DIGIT [21], which provided tactile images equivalent to simulated interaction from dynamic simulation. However, a limitation was that cases involving a large deformation of skin upon ViTac contact with its environment were not rendered properly. This issue is crucial in complex contact scenarios and reduces the accuracy of the sim2real transfer process.
Regarding marker-based ViTac sensors that are more comparable to our work, a bottleneck occurs in emulating the deformation of the soft skin by the movement of multiple markers upon external stimulation, which requires a comprehensive modeling of contact mechanics and materials. Ding et al. [22] built and emulated the elastic behavior of TacTip skin using Unity physics engine to estimate pin locations. However, the elastic properties of the skin were linearly approximated using a custom linear elastic model. Church et al. [15] proposed Tactile Gym that produces the virtual images of physical contacts through depth imprints using a rigid contact model. Tactile Gym was validated against finger-sized TacTip sensors of either hemispherical or rectangular shapes, in which the gap between real and virtual images was mitigated using a trained generative framework. Alternatively, commercial FEM-based simulators (e.g., Abaqus [23], [24]) offer a systematic way to accomplish this challenge by dividing the soft body into many subelements, which are then dynamically analyzed with hyperelastic material models. However, extreme computational costs and inflexibility of the commercial FEM simulators restrict the effective application of these methods in real-time scenarios.

B. Simulation-to-Real Transfer by Adversarial Learning
Deep-neural-network-based vision systems learned from virtual/synthetic images typically perform poorly when evaluated on real visual inputs [25]. Differences between simulation and real images are inevitable, which might involve unrealistic texture, color, and lighting conditions in the simulated results. Several previous studies introduced randomization for such visual aspects in simulation environments to narrow the sim2real gap, which has been successfully applied to vision-based robotic applications [26] and tactile sensing devices [18], [22]. Although these techniques enable a trained model to be more robust when transferred to a real dataset domain, they often require the manual input of visual features to randomize and need to be refined for specific tasks in unique environments. To address this, recently, a concept of pixel-level domain-adaptation-based sim2real transfer has been applied to the simulation framework for a small-scale marker-based tactile sensor [15]. Here, the researchers employed an auxiliary GAN [27] to translate real marker-based tactile images into depth-based simulation ones, on which deep learning agents were trained to perform specific tasks. Despite having shown successes in transferring a handful of high-level tasks with tactile sensory feedback, this method employed real-to-simulation (real2sim) translation for virtual tactile imprints (depth-based images), while real ones featured white markers that encoded high deformation of artificial skin. This may result in an inaccurate reproduction of the entire deformed state of the artificial skin at one local imprint. In addition, the accuracy of this real2sim translator on the entire real image domain has not been thoroughly evaluated.
In this study, we propose a promising simulation framework (SimTacLS) for ViTac sensors that is able to address the aforementioned demerits. The main contributions are pictured in Fig. 1 and highlighted as follows: 1) proposal of a simulation framework built from two kernels: SOFA and Gazebo, with a justified pipeline that allows a detailed investigation of marker-based ViTac sensors of diverse shapes and sizes; 2) the deployment of a supervised-learning-based multioutput regression model (TacNet), which takes virtual tactile images under various contact scenarios provided by the above framework as input to perform 3-D skin-shape reconstruction with high spatial-temporal resolution; 3) to mitigate TacNet's incompatibility owing to using real tactile images in actual sensor operation, a real-tosimulation conversion approach (R2S-GN) is introduced; 4) the demonstration of the application of this sim2real learning approach to a large-scale tactile sensor (TacLink) in tactile-based tasks using a custom-built robot arm with TacLink as a forearm link.

A. Hardware Design
This section describes the detailed mechanics of SimTacLS. We chose our previously developed large-scale ViTac sensor for the robot link (called TacLink [12], [13]), which uses the displacement of visual cue markers to deduce tactile sensory information, to demonstrate the proposed framework. The structure of this system is shown in Fig. 2(a). The details of the geometrical specifications of the skin on a barrel shaped body are: 260 mm high (or long depending on orientation); 3.5 mm thick; 36-mm-diameter cross-sectional area at each end; and 53.5-mm-diameter cross-sectional area at the widest point (the center). The body contains 256 white markers of diameter φ marker = 5 mm distributed on the inner wall of the soft skin. The distances between the outermost row of markers to the adjacent row and the small end of the skin are 15 and 17 mm, respectively.  [12] and [13] for more details on the geometrical design, constituent components, and fabrication process). (b) Cylinder markers attached on the tactile skin will be decomposed into two parts: marker bases and bodies. (c) Each tactile skin element will be imported to SOFA as a topological map of (c) tetrahedron elements for mechanical models and (d) triangular cells for visual models. Notice that while the high quality of the skin mesh still remains in this mode, the meshes for markers in visual model are refined significantly.
We have two reasons for using TacLink as the showcase in this article: 1) the barrel-shaped sensor can be scaled to other parts of a robot body, such as arms, legs, and chest; and 2) this shape allows a rare setup of cameras (opposing), which is considered challenging since large deformation of the skin may prevent cameras from capturing images of all the markers (occlusion). Therefore, the solution of this setup can be applied to many other designs of marker-based ViTac sensors [see Fig. 1

B. SOFA Module: Skin Reconstruction and Modeling Strategy
Within the SOFA environment, a mechanical model of TacLink skin comprises two separate models: bare skin and markers. These are then consolidated to each other in the simulation environment [see Fig. 2(b) and (c)]. Since SOFA allows the simulation of multiple meshes (with different objectives), the following discretization strategy is obeyed: 1) meshes for studying mechanical behaviors and visualization (visual models) of the bare skin must be sufficiently fine (skin size element is 12 mm); meanwhile, the contrary (marker size element is 1.5 mm) is applied for markers to reduce computational costs. Then, the positions of each of the degrees of freedom (DoFs) in visual models are inherited from the mechanical models through a mapping function ξ m before exporting (.STL files) to the Gazebo module.
1) Corotational FEM Approach: The softness of TacLink skin derives from the inherently nonlinear property of soft materials that always pose a critical challenge to mechanical modeling. One can tackle this issue with hyperelastic material models available in off-the-shelf simulation platforms [24], [28]. However, it requires tremendous effort to accurately identify all the necessary parameters through either experimental or numerical processes. Moreover, since we aimed to implement the proposed framework in real-time robotic applications, a method with high efficiency in computation was essential. In this work, connectivity among the vertices of nonoverlapping tetrahedron elements obey a linear constitutive relationship (Hooke's laws) as ascribed by two parameters: Young's modulus E and Poisson's ratio ν. E was experimentally identified as 0.1 N/mm 2 and ν was set at 0.49 [29]. To prevent any unrealistic simulation results due to this linear assumption, especially with large deformations (not only large displacement but also rigid rotation), a corotational FEM formulation was leveraged (see [30] for details). This allows a realistic simulation that captures the geometric nonlinearity of a hyperelastic material (i.e., small stresses produce large strains) in a cost-effective manner.
2) Dynamic Analysis: The generic dynamic equation for a deformable volume is shown as follows: where q ∈ R n is the 3-D position of element nodes (corresponding to N DoFs), M(q) is the mass matrix, F ext (t) denotes the external forces (e.g., gravity) at each time step t, and F int (q,q) represents internal forces upon the system state. Equation (1) is integrated over a specific time interval [t 1 , t 2 ], thus dt = t 2 − t 1 , using the backward Euler integration scheme [31] Substituting the linearization of the internal forces F int (q 2 ,q 2 ) by using Taylor series expansion with the first-order approximation (see [32] for more details) and two relationsq = q 2 − q 1 = dtq 2 andq =q 2 −q 1 into (2) yields where F ext 2 is the external force at the next time step; K = ∂F int ∂q and C = ∂F int ∂q are stiffness and damping matrices, respectively. The only unknown factor is J T λ, which represents the contribution of tactile interaction under the form of constraints. The Jacobian matrix J(q) = ∂ξ ∂q gathers the normal and tangential constraint (i.e., contact forces) directions of λ-equivalent to the magnitude of contact forces projected to the mapped DoF. Note that the above contact responses obey a combination of Signorini's frictionless contact law [33] and Coulomb's frictional law [34]. This is mathematically expressed by the following complementarity condition: In contact: Δ n = 0 ⇒ λ n > 0 No contact: Δ n > 0 ⇒ λ n = 0 (4) where Δ n and λ n represent the distance between two contact opponents and the contact force measured along the normal direction n, respectively. Once a contact is well-detected, the contact response J T λ is computed to determine the above condition. A more in-depth explanation of this particular procedure can be reviewed in [32].
To solve linear equation (3), there are several approaches offered by the SOFA framework. We leveraged the sparse LDL T factorization method [33] to decompose matrix A, where D is a diagonal matrix and L is a sparse lower triangular part of matrix A. Although this approach is quite costly, the reliability of the simulated mechanical behavior of the soft body (i.e., tactile skin) is assured.

C. Virtual Tactile Image Acquisition
The process for generation and acquisition of virtual (simulated) tactile images is performed using the combination of the Gazebo simulator and the Robot Operating System (ROS). The TacLink sensor is modeled as a robotic link using Unified Robot Description Format (URDF), 2 in which the geometric relations between sensor parts, such as housings and cameras, are defined precisely as in the design of a real device. In the URDF description, the Gazebo sensor plug-in providing the camera type of Wide Angle Camera Sensor is installed to enable virtual cameras to render images of the artificial skin (tactile images). To capture realistic images of skin deformation, the topological meshes of the sensor's soft skin and marker (.STL format) generated from SOFA simulation are updated at each time step and encoded in the SDF format 3 to communicate with the Gazebo environment .

D. TacNet-Based Skin Shape Reconstruction
TacNet was developed to reconstruct, from a pair of tactile images, the geometrical shape of soft skin deformed by external forces in order to deduce high-level tactile perception. This vision-based reconstruction problem can be formulated as a multioutput regression task: given a pair of marker-featured tactile images I = {I 1 , I 2 }, where I 1 and I 2 are RGB images of size 640 × 480 × 3 pixels, the network estimates the displacement vectors (D estimated ) of nodes N (|N | = 707 nodes) of a surface mesh representing the soft sensor skin (see Section III-B) where X i ∈ R 3 is the 3-D position vector of one active/free node i ∈ M, where M is a set of free nodes (|M| = n = 585 nodes), and X 0,i ∈ R 3 are the coordinates of the respective node under the original or nondeformed state of the artificial skin. Thus, from the estimated displacement vectors D estimated and original nodal positions X 0 , the skin shape can be reconstructed as X = The TacNet architecture is adapted from proven Unet convolution networks [35]. Basically, TacNet consists of a contracted convolution path connected with a reverse upconvolution one via skip connections and then followed by two fully connected (FC) layers (see Fig. 3). For the input layer, we concatenate the two tactile images I, downsampled to 256 × 256, to form a six-channel input visual signal. Moreover, the output signal, activated by the two last FC layers, is defined by a dense single layer with 1755 neurons to represent the estimated displacement vectors D estimated , which means that we consider every three adjacent neurons as a displacement vector.
2) TacNet Training and Loss Function: TacNet is trained completely on a simulation dataset with the input data I sim (pair of images obtained from simulated TacLink cameras) and corresponding output labels D FEM (ground-truth displacement vectors) generated, respectively, in Gazebo/ROS and SOFA environments (see Sections III-C and III-B). We apply the mean squared error (MSE) loss as an objective function to minimize the differences between the ground-truth and estimated displacement vectors (D FEM , D estimated ) and to optimize the weights of where D estimated = T θ (I sim ) and L MSE (·) is MSE loss, given by where d j i ∀j ∈ {x, y, z} are the components of displacement vector D i at the respective skin node i ∈ M along the x, y, and z axes. In fact, the MSE loss in (7) is derived to compute the difference in every vector component (or output neurons) to encourage the learning of both intensity and direction of displacement vectors. For the optimization (6), we use iterative stochastic gradient descent optimizer with the experimentally tuned learning rate of 0.015.

E. Real-to-Simulation Generative Network
The main purpose of the R2S-GN is to transform real tactile images (I real ) into ones (transformed images I tf ) that resemble visual inputs of the simulation dataset (I sim ) before they are fed to TacNet, so the performance of TacNet-based 3-D shape reconstruction is maintained in real-world deployment. For this purpose, the R2S-GN is trained in an adversarial manner, where it plays a role as a generator in a traditional GAN, competing against a discriminator in order to achieve its best in the transformation task.
1) Network Architectures: We exploited the adapted version of the U-Net convolutional network and the PatchGAN model, as described in [36], for the architecture of R2S-GN generator (G φ ) and discriminator (D ψ ), respectively. G φ takes as input the downsampled real images (I real , 256 × 256 × 3) on a encoder path and outputs the transformed counterparts (I tf ) on a reverse decoder path. Meanwhile, the discriminator (D ψ ) receives a 256 × 256 × 3 pixel input image, and the network classifies whether the images inputted is real or fake. Details of the network parameters for G φ and D ψ architectures can be found in [36].
2) R2S-GN Loss Function: We propose a hybrid loss function L R2S-GN that is used to train the R2S-GN generative network (G φ ). This loss function comprises three terms, including conditional GAN (cGAN) adversarial objective, 1 distance, and structural similarity (SSIM) loss. a) Image appearance loss: Inspired by Zhao et al. [37], we introduce an appearance loss that combines 1 distance with the SSIM metric (for image quality quantification) [38]. This loss function, which evaluates on a pixel-by-pixel basis, is defined to match the appearance of the fake transformed images I tf with the real simulation ones I sim , as well as to ensure SSIM between them. This is vital to generate images with the same geometric perspective as simulation ones, in order to maintain the skills of simulation-trained TacNet. Thus, for a given batch of one training sample, this loss is given as where we apply a 11 × 11 Gaussian kernel for the computation of the SSIM metric.
b) Adversarial loss: In addition to the appearance loss, we adopt the cGAN objective [36] for an adversarial loss term. For a given real tactile image I real , the adversarial loss for the R2S-GN G φ can be expressed as where D ψ , besides observing the transformed version of tactile image I tf = G φ (I real ), is conditional on the input of G φ , particularly I real . This conditional discriminator has been shown to improve the performance of numerous image translation Fig. 4. Training scheme for the R2S-GN model, mostly following the procedure described in [27], however, with the modification for the inclusion of R2S-GN loss (see Section III-E3 for details).
tasks [36]; thus, we applied this concept in our real2sim network. Intuitively, the R2S-GN G φ tries to minimize this objective function by generating the transformed image that can fool the adversarial discriminator D ψ into predicting it as a real simulation image. As a result, the overall loss objective for R2S-GN G φ is the combination of appearance loss as well as the cGAN criteria Adversarial loss (10) where we set the hyperparameters α = 100, β = 200, and γ = 1, which are tuned experimentally. Finally, for training the adversarial discriminator D ψ , we use the cGAN objective as described in [36]. For one training sample, the discriminator loss is given as (11) The second term characterizes the adversarial training behavior where the discriminator tries to maximize the R2S-GN's adversarial objective [see (9)]; meanwhile, the R2S-GN attempts to minimize it. The overall loss [see (11)] indicates that the discriminator would do its best at discriminating the transformed images I tf with simulation ones I sim , which, in turn, penalizes the R2S-GN to generate I tf that more closely match the appearance of I sim .
3) R2S-GN Training: We follow the typical procedure of adversarial GAN training [27] for optimizing the weights of the R2S-GN G φ (see Fig. 4). Specifically, for discriminator training, we set its output label to a positive class (real ) given that the input is a simulation image, and to negative class (fake) provided that the input is a transformed one. As for the R2S-GN, in addition to computation of the L img loss, the output label of D ψ is set to the real positive class in order to promote the adversarial L adv loss [36]. For the learning process, we used the Adam optimizer with linear learning rate scheduling [39], initialized at 0.0002 and set to decay at the 100th iteration out of a total of 200 training steps.

IV. LARGE-SCALE TACTILE PERCEPTION METHODOLOGY
Large-scale ViTac sensors are suitable to offer multipoint physical interactions, which embrace new possibilities for tactile interfaces and make them unique compared to their small-sized counterparts (e.g., tactile fingertips). Among information that can be extracted from the multipoint stimuli, the identification of contact locations on an artificial skin body has found practical use in robotics tasks, such as providing feedback signals for a collision handling framework [40]. Therefore, in addition to a simple method for contact event detection (see Section IV-A), we developed an algorithm to identify multiple contact locations on a large-scale skin (see Section IV-B), which are reasoned from the TacNet model.

A. Touch Sensing
The detection of touch/contact events is considered to be fundamental in safety-critical robotic systems [40]. Here, we present a method to extract contact detection signals based on the recognition of skin deformation.
The contact detection problem can be formulated as a binary classification task where given the displacement vectors D estimated estimated by TacNet (5), we assign a contact detection signal, which is 0 for data without contact and 1 for data with contact detected. Thus, the contact detection signal can be derived as In other words, for each contact detection, we set a contact detection threshold c on the estimated displacement magnitude of free skin nodes D estimated,i ∀i ∈ M, where c depends mainly on the accuracy of TacNet estimation, which would influence detection sensitivity and accuracy. The detection threshold is determined such that contact detection performance reaches a good compromise between precision and recall, which are the metrics of a general binary classifier. We use the simulation dataset to determine the detection threshold, which is expected to transfer well to the distribution of real data. The results of this contact classifier are discussed in more detail in Section V-D.

B. Multipoint Contact Localization
This section presents an algorithm that can identify the contact positions at multiple points on the sensing link. This detection method assumes that any contacts occurring between the sensor skin and external objects are point contacts, which is considered to be a reasonable assumption in practical applications [40]. In general, the algorithm applies the concept of graph-theory-based connected-component labeling [41] to extract contact regions, which we named contact region labeling (CRL), from which the contact positions are identified. Here, we modeled the mesh of artificial skin as an undirected graph, G = (V, E), whose vertices represent the mesh nodes (|V| = |N | = N ) and contain information on the displacement vectors estimated by TacNet (D estimated ∈ R 3 N ). Besides this, every graph node contains information of a fixed radial vector pointing toward the central axis of the skin to determine the inward deflected nodes. Thus, where x z 0 is the z-component of nodal positions X 0 in the undeformed state.
To run CRL for the extraction of distinct contact regions, we need to determine which nodes of the skin are likely to be experiencing contact. Accordingly, we define N -tuple of binary nodal contact signals s = (s 1 , . . . , s N ) ∈ Z N 2 , where its element s i holds a binary value s i ∈ {0, 1} such that s i = 1 indicates that the corresponding node i ∈ N is in contact and definitely lies in one contact region; otherwise, s i = 0 signifies that the given node remains intact. Specifically, the nodal contact signal for each node i ∈ N is derived as In other words, a node is considered to lie in a contact region if its nodal displacement exceeds a constant threshold d and the direction of its displacement vector has to be pointing toward the skin central axis. Under contacts mostly due to pushing/pressing actions, the latter condition helps to restrict contact regions to those that contain nodes deflecting inwards, as opposed to those regions that bulge out. This is measured by the directional similarity term d sim ∈ [−1, 1] (15), which, in fact, is cos ϕ i (where ϕ i is the angle between two vectors D estimated,i and N i ). Given the skin graph G and nodal contact signals s, we perform the CRL procedure to extract possible multiple distinct contact regions (see Algorithm 1: CRL function). This procedure employs depth-first search (DFS) to traverse across vertices V of graph G that contain the corresponding nodal information of s. On the search path, it selectively assigns a contact region label l ∈ {1, . . . , L} (L is the number of contact regions) to every node that holds the signal s i = 1 such that a cluster of contacted nodes (or a contact region), separated from others by undeformed nodes (s i = 0), shares the same region label l. As a result, we can obtain a set of labels y = (y 1 , . . . , y N ) ∈ Z N L+1 whose element y i ∈ {0, 1, . . . , L} corresponds to the region label of node i ∈ N , and y i = 0 marks nodes inside the undeformed region. From y, we can extract contact regions that are represented by the node indices. Thus, for a given contact region R l that has the region label l, we have Finally, for every single contact region R l , we search for the node i * l ∈ R l that maximizes the displacement magnitude and consider it as a location where a contact occurs From that, contact positions X c = (x c 1 , . . . , x c L ) ∈ R 3 L can be identified from the extracted contact regions In addition, corresponding contact depthsd(x c l ) can be derived asd The step-by-step algorithm for multipoint contact sensing is described in Algorithm 1, whose complexity mainly depends on the size of the skin graph O(|V| + |E|). In addition, the spatial resolution is defined by the fineness of the constructed skin mesh, which poses a tradeoff between the resolution and computational costs; the greater the resolution, the more the computational time. Moreover, the assumption of point contact can be relaxed if contacts induce concave deformation of the skin surface, whereby the detector yields an approximated contact position at the node that was displaced the deepest; however, detection accuracy would fall as the contact plane extended. Finally, in cases where regions of multiple contacts overlap such as an event when two discrete contact points are sufficiently close, the detector might deduce the different regions to be a single large contact area. This sensing behavior, mostly affected by the distance between two contact points and the selection of threshold d , along with localization accuracy is discussed in Section V-E.

V. PERFORMANCE EVALUATION
To evaluate the performance of the proposed SimTacLS framework, we conducted some experiments on tactile perception. We used a desktop PC (AMD Ryzen Threadripper 3970X Processor) with GPU acceleration (RTX 8000, NVIDIA) for model training and inference. Demonstrations are available for review in supplementary video associated with this article.

A. Data Collection
In this article, we collected several datasets from simulation and actual models for training and to assess the feasibility of SimTacLS. The details are listed in Table I. The contact locations were set among free nodes M; thus, the total sampled points for the single − contact dataset was 585. To test whether the method could achieve tactile perception in complex scenarios with poor prior experience, 500 contact groups consisting of two arbitrary points among M were made to produce the double − contact dataset. An experimental setup with an identical reference coordination system, as illustrated in Figs. 2 and 5, was constructed to collect real tactile images. The experiment included three motorized linear stages (Suruga Seiki Co., Japan), a rotating motor (Dynamixel XH430-W350-R, ROBOTIS, Inc., USA), and a stepping motor controller (DS102, Suruga Seiki Co., Ltd., Japan), fixed on a testbed (see Fig. 5). The X-axis stage (PG750-L05AG-UA) drives a spherical-head indentor (12 mm diameter) designed to push one node at a time to the desired contact depth on the skin. The contact locations on the skin outer surface were achieved by horizontal movement of the indentor and the rotation of the TacLink sensor, facilitated by a Z-axis linear carrier (KZS18300) and Z-axis rotating motor, respectively. Meanwhile, the Y -axis stage was preadjusted in advance to ensure that the nominal axis of the indentor intersects with the Z-axis of the reference coordinate system (i.e., centerline of the TacLink sensor).

B. Image Transformation With R2S-GN Loss
The performance of the R2S-GN was evaluated by the similarity between pairs of transformed and simulation (baseline) images in terms of spatial image structure. We measured the SSIM index and the complement of per-pixel root mean square error (pixRMSE = 1 − pixRMSE) of the simulation-transformed image pairs. In addition, we compared the performance of the R2S-GN model learned with R2S-GN training objective L R2S-GN [see (10)] and the one trained using solely adversarial loss L adv and the other with L adv-L1 := L adv + L L1 (where L L1 is the 1 -distance loss). The R2S-GN was trained with a total of 18 640 pairs of single-contact actual-simulation images and 4780 image pairs that capture both single (4660 pairs) and double contacts (120 pairs) was devoted for evaluation. Fig. 6(a) shows the SSIM of tested simulation images with real images transformed by three variants of the R2S-GN (trained, respectively, by the three aforementioned losses) and expresses the variable of their resemblance according to increased contact (touch) depth. In fact, the L R2S-GN -based R2S-GN generates images that are far more similar to the simulation baselines, providing an average SSIM of 0.96 and an average pixRMSE of 0.95 at 20 mm contact depth, compared, respectively, to 0.91 and 0.90 of the L adv -based transformation model. Over the observed range of contact depth d c ∈ [1,20], while the former model shows a slight drop in both SSIM and pixRMSE metrics (i.e., around 3.5%), the latter one shows a more significant (7%) drop in SSIM. Moreover, Fig. 6(b) displays representative tested samples of single-and double-contact tactile images with the contact depth of 15 mm. It shows that the unseen tactile images can be generalized and generated well by the R2S-GN even when the model is never trained on the double-contact images, and it once again confirms the effectiveness of the proposed R2S-GN loss. In the next subsection, we further evaluate the effectiveness of variant RS2-TN models in addressing the sim2real problem.

C. Sim2Real Transferability of Contact Depth
The performance of the TacNet-based shape reconstruction was verified by evaluating the measurement error of the local contact depth [see (19)], both in simulation and real datasets to prove sim2real effectiveness. For evaluation, Unet-based TacNet with 2048 neurons for each of the last two FC layers was employed, since it was shown, through fivefold cross validation, to outperform other model backbones, such as VGG and ResNet in terms of inference accuracy, speed, and memory usage (see Fig. 7). The used TacNet model was completely trained on the simulation dataset including both single-and double-contact images (28 055 pairs of virtual tactile images), in which 20% data of each contact type were withheld as a test fold for validation. For sim2real evaluation, we experimented on a subset of real double-contact images and a full set of real single-contact images corresponding to the simulation test fold (see Table I).
The experimental results showed that measurement errors increased with true contact depth (d c ) in the case of simulation and L R2S-GN -based translated visual inputs, while pure real ones, without passing through the R2S-GN model, experienced significant errors, yielding estimated values unchanged [see Fig. 8(a) and (b)]. The absolute errors at d c = 20 mm were below 2 and 4 mm, which approximate to full-scale errors 10% and 20% (with FS 20 mm) for simulation and translated inputs, respectively. Fig. 8(c) shows that the L R2S-GN -based R2S-GN model was superior to the two other variants trained by L adv and L adv-L1 , which reduces the full-scale errors of around 25% and 10%, respectively, at d c = 20 mm. In addition, we showcase the visualization of the skin shape reconstruction on two representative scenarios of single-and double-point contact with the depth d c = 15 mm (see Fig. 9). A highly similar sensing pattern between simulation and real (via L R2S-GN -based R2S-GN) samples was observed in the case of single contact, with an absolute error of around 1.5 mm. In the case of double contact, the mean absolute errors at the two contact patches were 1.31 ± 0.65 mm and 2.92 ± 0.50 mm for the virtual and translated real input, respectively. The occurrence of greater sim2real discrepancies in the double contact was because the R2S-GN had not been trained on double-touch data, which probably results in greater dissimilarity in image structure, especially at large contact depths.
It is worth noting that the sensing performance (specifically, the recognition of contact depth) is dependent on regions of the skin. We conducted an experiment, where the contact was made on various locations (ten locations) along a longitude line of the skin. At each contact location, two contact depth values of 5 and 10 mm were given. The measured data (contact depth value inferred from real images passing through the L R2S-GN -based R2S-GN) and the ground truths are shown in Fig. 10, revealing that the sensitivity decreases at the equatorial area of the skin. The same issue was actually found in the previous research [11], in which the detection error of around 7% of full scale with  the careful calibration of cameras. Therefore, the accuracy can be improved by thorough calibrations in which calibration parameters would be identified differently for respective contact regions. By doing so, each fabricated sensor needs its own calibration of cameras, even though the design is similar. In this research, our proposed method does not require any calibration, thus accommodating unlikeliness in the fabrication process. One can consider using our method with calibrated parameters to increase the accuracy of the sensing operation. Last but not least, while the contact depth accuracy considerably depends on regions of the skin, this problem is not seen in the context of contact localization. In fact, as presented in Section V-E, the variation of localization errors is not significant when evaluated on a wide range of contact regions.
Overall, in the context of sim2real transfer for large-scale ViTac sensing, to our knowledge, the obtained results in this article set the benchmark for further development, and the reconstruction errors are within an acceptable range compared to previous work [11], [12].

D. Sim2Real Transferability of Contact Detection
This section examines whether the performance of contact detection learned from virtual data, specifically that are inferred from TacNet, can be transferred into the real data domain with the help of the R2S-GN model. We initially examine a suitable contact detection threshold * c that would maximize the detection capability using the virtual image dataset. The selection was conducted based on the analysis of the precision-recall tradeoff [39] of the touch classifier over a finite range of the decision threshold ( c ). Based on the precision-recall plot [see Fig. 11(a)], it is reasonable to use a contact threshold value of 0.6 mm for sim2real evaluation, which maximizes contact sensing performance with 100% recall and precision.
The accuracy of the contact detection evaluated using test simulation dataset and corresponding real images is presented in Fig. 11(b). All the pure real images capturing the nondeformed skin are mistakenly classified as contact events (95% precision). However, the result [see Fig. 11(b)] reveals that this sim2real problem could be addressed by the intervention of the R2S-GN. In fact, real images passing through the R2S-GN model allow successful transfer of the threshold learned from the simulation to the domain of real images, in which the best precision and recalled values are retained (i.e., 100%).

E. Sim2Real Transferability of Double-Contact Localization
This subsection examines the accuracy and transferability of contact localization in the scenario of double contact. Three contact groups (groups I, II, and III) distinguished by the vertical distance between the two random contact points [180, 140, and 100 mm, respectively, as shown in Fig. 12(a)] were tested. For each group, based on our proposed localization method (see Section V-E), we determined from SOFA simulation data (D FEM ) at what range of contact depth, the two separated contact areas were recognized with respect to the threshold d [see (14)]. Group I showed the largest detection range followed by group II, and detection range increased with increase in threshold increases [see Fig. 12(a)]. Furthermore, Fig. 12(b) compares these results with estimated displacement data (D estimated ) inferred from virtual tactile images (second column) and real tactile images (third column) at d = 9 mm, which yields the highest two-point detection accuracy. Except for Group I, which is still considered acceptable in both cases, the detection range of Group II was significantly down, while contacts among Group III (two-point distance is relatively close) were detected as one large contact area. The error came from the fact that our method utilized the displacement of nodes located not only in the actual contact sites but also in the regions surrounding them or the contact regions. That causes a narrow two-point detectable range of contact depth when d is small [see Fig. 12(a)]. Other reason is that the occlusion is easier to occur when two points are close together. If two points share the same height or bias from a vertical direction, such situations are anticipated less critical than the tested cases due to the fact that every horizontal cross section of TacLink skin is parallel with image planes, so we could expect a clear vision of the contact areas (less occlusion). Fig. 13(a) shows averaged localization errors between estimated and actual in-touch positions using simulation and transformed real tactile images for Groups I and II, while Fig. 13(b) visualizes the localization task in action at d c = 15 mm. Overall, the obtained results revealed the feasibility of sim2real transfer for multipoint contact localization. However, some sim2real gaps remain due to the fact that TacNet was generally trained with a poor double-touch database, and the R2S-GN did not possess prior relevant knowledge and was not powerful enough to handle such complex interactions.

VI. APPLICATIONS OF LARGE-SCALE TACTILE SENSING
This section describes two trial cases, including nonprehensile manipulation and haptic interface, for TacLink and its transferred tactile information of multipoint contact depth and contact location. In order to perform the tasks, a large-scale TacLink sensor utilized as a robot link (forearm) was attached to the elbow joint of a three-DoF custom-built robot arm [see Fig. 14(a)]. Note that spatial data defined in this section are all referenced to the fixed space frame {s} of the robot base, excluding those specified by left superscripted indices.

A. Nonprehensile Manipulation by Whole-Arm Pushing
This section showcases how a three-DoF robot arm could push an object toward a goal facilitated by an attached TacLink sensing device providing the function of contact location. For simplicity, we restricted the pushing task to be performed on aŷ sẑs plane of frame {s}. To perform the task, we employed simple proportional controllers that obtain feedback of the 3-D position of a pushed object x object ∈ R 3 determined from contact made with TacLink, used to compute the desired spatial velocity c V d ∈ R 6 with reference to the contact frame {c} to guide the object toward the preset goal. The {c} frame, represented by rotation matrix R c , was defined to have the origin at the contact location x c [retrieved from (18)], theŷ c andẑ c axes pointing along the outward normal of the contact plane, and along the z-axis of the TacLink frame (see Fig. 5), while thex c -axis complements the others by the right-hand rule. Given the object position (x object ≡ x c ), and goal location x goal ∈ R 3 , the pushing task was controlled such that the pushing direction n push was perpendicular to the contact plane (ŷ c ≡ n push ) Therefore, the required angular velocity ω d ∈ R 3 to achieve the desired pushing direction can be devised as where k ω > 0 is a proportional gain of angular velocity, and ω dθ ∈ R 3 denotes the exponential coordinates of a rotation matrixR := Rot(ω d ,θ) = R push R T c ∈ SO (3), where R push := [x s , n push ,x s × n push ], that rotates the contact frame {c} toward the pushing direction. In addition, since the contacted object would deliberately be pushed along n push , the commanded linear velocity can be derived as On top of that, respecting a typical safe human-robot interaction scenario, we imposed a condition on the proposed pushing control to halt the robot motion in the event an unplanned contact (external contact) occurred, i.e., other than with the target pushed object. Hence, assuming that there always exists one contact with the target object, we have Finally, the desired twist c V d was mapped to commanded joint velocityθ ∈ R 3 through Jacobian c J ∈ R 6×3 at the contact point. The results of the described contact-based pushing experiment are shown in Fig. 14, wherein the goal position was set as x goal = [−0.01, −0.17, 0.73] T , and the proportional gains were experimentally tuned as k v = 0.12 and k w = 0.35. During the pushing trial, the primary contact with the object (i.e., a water-filled bottle) maintained a relatively stable contact intensity of around 14 mm, except for when an unplanned contact occurred [see Fig. 14(b)]. Once a human suddenly touched the TacLink, triggering a secondary (external) contact phase, all robot motions halted, which resulted in a goal error (defined as x goal − x object ), and the time-dependent object position (x object ) remained unchanged [see Fig. 14(b) and (c)]. Over the course of time, the pushed object gradually reached the preset goal, which took around 60 s for the entire process. However, there remained a small degree of settling error along the z-direction, resulting in more or less 0.05-m goal error. This might be addressed by incorporating the integral term to the

B. Haptic Interface for Motion Guidance
This section highlights the utilization of TacLink as a haptic interface device for intuitively guiding the motion of the robot arm [see Fig. 15(a)], where we strategically mapped tactile actions, including single/multipoint push and stroke into a desired robot twist b V d ∈ R 6 . For the single-point push actions happening at contact location x c [see (18)], the estimated contact depth d(x c ) [see (19)] and the normal direction n(x c ) := N i * [see (13)] are encoded to the spatial velocity on thex bŷb plane of the end-effector frame {b} as where k v d is a constant to appropriately scale the resulting linear velocity. In addition, we employed distinguishable two-point contact as an interface for instructing rotational motion, wherein a virtual pivot point b r c was placed at the center of TacLink. Thus, the rotational motion around axes of {b} frame can be defined as Since, for simplicity, the push direction was restricted to the normal of a contact plane, we neglected the rotation/twist around the z-axis (w z = 0), and the linear motion (v z = 0) as well. However, the linear velocity along the z-direction could be induced by detecting the stroke action (SA). For robust stroke detection, we introduced a fixed time window T w = W.Δt (where W is the window size), during which the possible sliding motion on the skin surface is evaluated at a determined interval Δt. Therefore, at each time step Δt in the time window T w , we measured the distance between the current contact position where η is a classification ratio experimentally set as 0.3. When a stroke occurs, from t ≥ T w , the linear velocity along the z-axis can be encoded as where x c,z t+Δt and x c,z t are the z-coordinates of x c t+Δt and x c t , respectively, and k ω d is a constant scale of the angular velocity. Finally, the resultant desired twist b V d = [v x , v y , v z , w x , w y , 0] T was mapped to commanded joint velocityθ through Jacobian b J at the end-effector.
We validated the proposed interface scheme in several experiments, including various contact actions and scenarios. The control parameters are summarized in Table II. In the first showcase   Fig. 15(c)]. In fact, a stroke, performed by a human digit, yielded relatively sharp changes in the contact depth profile, while stable intensity was observed with a push action. The linear velocity along the z-axis, resulting from the SA, was linearly scaled with the rate of contact positions over the duration Δt = 0.03 s, while the robot motion along the other two axes was triggered by the push action with the speed and direction depending on the contact depth and location [see Fig. 15(b)], respectively. Note that since the push/stroke classification requires the window size W = 8 to execute, the robot's response would be delayed for at least 0.24 s, just as the time window T w . In addition, the delay time might increase due to the misclassification between the single-and two-point contact scenarios. While this problem could be addressed by more advanced classification algorithms (e.g., machine learning techniques), it can positively enable users to feel safer in the human-robot interactive tasks.
Furthermore, in the two-point contact condition [see Fig. 15(d)] and its specific pairs of contact positions [see Fig. 15(b)], the robot rotated around eitherx b orŷ b . This showcase of haptic interface for motion guidance based on our large-scale tactile device is expected to provide initial hints for more action-based human-robot interaction strategies. The demonstration of robot motion guidance and other applications of TacLink on safety control can be found in the video. 4

A. Skin Geometry Affects Scalability and Extensibility
The showcase of SimTacLS in this article is based on the large-scale TacLink sensor, whose shape and size are unique and much different from the typical ViTac sensors reported in the literature. It is paramount that the working principle of SimTacLS does not depend on skin size or shape. Within the scope of this article mainly focusing on the applicability of SimTacLS to large-scale sensors, an in-depth discussion of the previous argument is essential.
Theoretically, skin geometry poses no restriction to the SOFA simulation tool as long as the material properties and boundary conditions of the skin are accurately given. Similarly, the Gazebo module is not dependent on skin geometry but on light conditions and on extrinsic and intrinsic parameters of the camera system. We posit that if the skin geometry allows the visibility of visual cues in tactile images at each tactile interaction phase, then Sim-TacLS is applicable. For example, owing to the nature of TacLink skin where markers are vertically aligned [see Fig. 2(a)], markers in the middle region are very sensitive to occlusion, especially during the contact phase. An adverse effect of such phenomenon on sensing performance was certified in [11] in the case of cylindrical skin and was expected to be more detrimental in the case of a barrel-shaped body. However, existing ViTac skins on bodies with flat or curved designs possess markers that are horizontally distributed all over the image plane. As a result, the effect of missing visual cues is minimized. This discussion strengthens our hypothesis that SimTacLS is promising for ViTac systems of various sizes and shapes.
On the other hand, regarding markers, it is obvious that their morphology and density determine the resolution and accuracy of obtained tactile information. Nonetheless, it is still questionable whether marker distribution or even marker design affects the overall performance of TacNet. Since SimTacLS is designed toward a unified platform for acquiring data and training online, it is worth investigating the morphological design of markers and their related meshing in SimTacLS to accelerate the process without compromising sensor accuracy. While the data acquisition process of the SimTacLS platform has been implemented offline, it is required to accelerate the computation process as much as possible as technology allows such as by multireading or GPU-based computing to meet potential requirements of real-time applications.
Regarding the bandwidth of the proposed sensing system, it depends on the required processing time and mechanical properties of the soft skin. Since trained networks are compact in size, real-time processing is not a problem as demonstrated in our previous analytical method [11]; therefore, the sampling rate is dependent on the frame rate of the camera. In this research, we could implement 120-Hz sampling rate on 120-frames/s cameras on TacLink. Therefore, the bandwidth of the sensor can be justified mostly based on the mechanical properties of the soft skin. For the TacLink, the stiffness of the skin can also be varied by the inner pressure value; thus, it is expected that the bandwidth of the sensor could be implemented online, given pretraining in SimTacLS.
B. Open Problems 1) Force Detection: In this article, SimTacLS was exploited to assess skin deformation resulting from contact occurring on the whole-body because such information can reveal features of tactile perception (contact location, size of contact area, vibration, etc.) at large scale. Human mechanoreceptors cannot convey in detail how much force is acting on skin. In addition, large-scale sensing is usually aimed at human-robot interactions rather than task-based ones, where information of force is deemed redundant. On top of that, toward a simulation framework for interactive robotics systems, TacNet was designed to be easily adaptable for different physical attributes, especially contact forces, other than the prediction of nodal displacements (skin deformation). In the future, for the physical formulation of interactive control problems, we aim to replace the current output signals of TacNet with nodal forces (which can be extracted from SOFA-based simulation), from which multicontact forces and locations at a large-scale skin can be effectively inferred.
In fact, contact force information [λ in (1)] modeled from the SOFA kernel could be targeted to train TacNet models in which the same proposed sensing methods can be applied to extract high-level perception. Note that this process only requires the additional collection of contact forces and the pretraining of a TacNet model, but without any further change in the proposed pipeline.
2) Two-Point Touch Discrimination: As shown in Section V-E, TacLink could best detect two contact points (aligned vertically) separated by a distance of 140 mm, which yields an acceptable sensing behavior for a whole-arm ViTac device with a very soft skin. Related to the touch acuity of humans, large body parts, such as arms or torso, encode low spatial acuity of around 45 mm, while the two-point touch threshold of fingers is about 2-3 mm [42]. In fact, the two-point touch threshold of the present sensing device may be adjusted by varying the skin morphology, such as changing the skin material (e.g., stiffening) or increasing the air pressure in the enclosed skin. We may expect a shorter detectable two-point distance as the skin becomes stiffer, which results in the reduction of the number of deflected nodes under the same acting force [expressed through (3)]. In addition to a mechanical solution for adaptable sensing behavior, it is possible to enhance two-point spatial acuity by utilizing contact forces, which are represented by the Lagrange multipliers λ [see (1)], as a source for CRL algorithm rather than nodal displacements. Thereby, with the contact forces inferred by TacNet (trained on the force labels obtained from SOFA kernel), contact regions would be narrowed down to contain only the nodes in physical contact with the external environment.
3) Applications: In this article, we attempted to showcase the use of the TacLink device in task-based interactions, including object pushing, motion guidance, and contact detection/reaction, as highlighted in Section VI, by which we argue that these tasks are infeasible to achieve by existing small-scale tactile sensors. These preliminary demonstrations are also expected to lay the groundwork for more sophisticated tasks based on tactile sensing at large area, such as haptic exploration/manipulation in cluttered environments [43] and robot learning by demonstration. Last but not least, the provided tactile information could be integrated with proven high-level controls for other robot systems, such as mobile robots that are beyond the robot manipulator presented in this article.
C. Novelty 1) Large-Scale Tactile Sensing Problems: Even though previous works, such as TACTO [14] and Tactile Gym [15], could perform simulation on ViTac sensors in bodies of different shapes (e.g., hemisphere, cylinder, or flat), the feasibility of applying these simulators to large-scale sensors remains questionable. First, the operation of GelSight-like sensors strongly depends on the gel layer and the reflective-light work environment, which pose challenges in setting up suitable lighting conditions over a large area. Second, since external objects often come into contact with the large-scale TacLink in a direction perpendicular to the optical axis of a camera, there is a high possibility that two different contact locations may each or in combination yield imperceptible depth-based images. Hence, virtual depth-based images might provide insufficient and accurate tactile information in the context of large-scale ViTac soft sensors. Because of these possible problems as tactile devices scaled up, we strongly argue that a more realistic simulation platform with high-fidelity soft body interaction and realistic marker rendering, such as SimTacLS, is essential and worth investigating for accurate tactile sensing at a large scale.
2) Task Transferring Schemes: As reported in [15], Tactile Gym trained tactile-driven tasks, such as edge/surface following, object rolling, and so on, with the input of depth imprint images through reinforcement learning frameworks, which coupled tactile sensing with task performance. Thus, training a main task in the coupled manner may require setting up a new environment or retraining from scratch as tasks are newly defined (due to training losses defined differently on a task basis). In contrast, Sim-TacLS decoupled sensing problems from the desired end-task performance; thus, transferred tactile information was utilized as tactile feedback for control tasks, but not tactile images. By doing this, we could focus on integrating the transferred tactile information into potential or novel tactile-driven tasks.

VIII. CONCLUSION
In this article, we presented a pipeline named SimTacLS for simulation and training of a ViTac sensor at large area, taking into account compliant contact mechanics of the skin and actual showcases in robotics. The pipeline offers rich tactile information, particularly skin deformation, to learn sensing skills for a large-scale TacLink device. We demonstrated that a tactile neural network (TacNet), learned from the obtained simulation dataset, could trigger high-level tactile perception (i.e., contact detection and localization) with potential to benefit robotics tasks. In comparison with other tactile sensors (of different sensing principles), our system offers large-area sensing with a simple setup and least influence on the mechanical properties of the soft skin (no embedded sensing elements). Meanwhile, the proposed system requires large amount of data for training, which may increase the implementation time and cost. On top of that, the pipeline has possibilities for transferable learning of robotics tasks in virtual environments and leaves room for the scalability of a broader range of ViTac devices of diverse shapes and sizes. In the future, more elaborations on the application of the proposed system will be tackled to bring in a holistic approach for the implementation of large-area tactile sensing-based robotic scenarios.