Review of Bioinspired Vision-Tactile Fusion Perception (VTFP): From Humans to Humanoids

Humanoid robots are designed and expected to resemble humans in structure and behavior, showing increasing application potentials in various fields. Like their biological counterparts, their environmental perception ability is fundamental. In particular, the visual and tactile perception are the two main sensory modes that humanoids use to understand and interact with the environment. Vision-Tactile Fusion Perception (VTFP) has shown multiple possibilities for better sensing understanding in challenging conditions, causing new research interests and questions. The overlap between visual and tactile perception in humanoids is continually growing. This work has reviewed the current state of the art of VTFP. It starts with the physiological basis of biological vision and tactile systems as well as the VTFP mechanisms as inspirations for humanoid perception. Then, the bioinspired visual-tactile fusion systems for humanoids are reviewed as the emphasis. After the survey on the vision and tactile sensors of robots, seven currently publicly available VTFP datasets are introduced. They are the data sources for several studies on neural network-inspired fusion algorithms. Furthermore, the applications of VTFP on humanoids are summarized. Finally, the challenges and future work are discussed. This review aims to provide several references for further exploitation of VTFP and its applications on humanoids.


I. INTRODUCTION
H UMANOIDS are robots that simulate the human structure. Tremendous demand in application areas, for example, elderly care, direct contact control during exceptional situations, such as the COVID- 19  human-robot interactions [1], are accelerating their development. Compared with traditional robots, humanoids should have at least three indispensable elements: (1) sensing for environment perception [2], (2) thinking for decision-making [3], and (3) execution for environment interaction [4]. The environmental perception is the most fundamental element of humanoids. It is of the same importance to human beings. Human beings apply five senses (sight, touch, hearing, smell, and taste) to respond to environmental stimuli and to collect perception information. These senses have also been used for robots, especially for visual and tactile perception. Robot vision is a fast-advancing field that enables robots to obtain vision information (including size, shape, color, and brightness) for various tasks, such as Visual Simultaneous Localization and Mapping (VSLAM) [5], [6], visual servo grasping [7], and visual navigation [8], [9]. Robot tactile perception is indispensable for tasks such as stable grasping [10], item classification [11], [12], and contact force control [13], [14], using interaction information of contact texture, object weight, material compliance and interface temperature. Situations with poor/unstable light illumination or recognition of large objects would require multiple-dimensional information of the environment using both visual and tactile perception [15], i.e., Vision-Tactile Fusion Perception (VTFP). The overlap between visual perception, tactile perception, and robotics is continually growing, and recent advances are summarized in Fig. 1. The hierarchical functional and structural block diagram of the VTFP system is shown in Fig. 2, including the following: (1) Sensors. The cores of sensors are sensitive elements that respond to stimuli, which are transferred to electrical signals.
(2) Information fusion methods. Vision-Tactile Fusion (VTF) applies various fusion algorithms to extract multidimensional visual and tactile information. Meanwhile, it is notable that datasets are important for fusion algorithms that are inspired by neural networks.
(3) Action. Actions offer vivid demonstrations of the VTFP results. On the other hand, actions influence the information collection for active perceptions.
Extensive works reviewing robot vision perception [31], [32] or tactile perception [33], [34] have been offered. However, a review of robot VTFP has long been absent until very recently. Shuo Gao and his colleagues [35] published a review on this topic, introducing the working mechanisms of tactile and visual sensing and their application in intelligent humanoids and discussing current challenges  [16], [17], [18], [19], [20]. (Middle) A range of applications based on visual-tactile fusion perception, including object recognition, human-robot interaction, delicate manipulation and stable grasping, from left to right and top to bottom [21], [22], [23], [24], [25], [26], [27], [28], [29]. (Bottom) A range of visual sensors including cameras based on biological principles, as well as depth cameras, from left to right [21], [30]. Red boxes indicate the work that utilizes visual and tactile sensors for visual-tactile fusion dataset acquisition. and future trends. While some topics addressed in this text overlap with the prior survey, the focus of this paper is different: it concentrates on additional topics, such as biological sensing systems, biologically inspired or mimicked vision and tactile sensors, and datasets and neural network-inspired fusion algorithms. Biological systems are the inspiration sources of various robotic engineering studies [36]. This paper surveys the state of the art of VTFP in humanoids regarding their natural counterparts. We limit our review of visual sensors to cutting-edge biologically inspired sensors and contrast tactile sensors with biological sensors. Humans use their neural system and the brain to fuse multimodal sensing information for decision-making [37], [38], [39], [40]. Artificial intelligent algorithms are studied to learn such functions via neural networks [41], [42]. Thus, the sensing datasets and the fusing algorithms are obviously important for VTFP. Efficient VTFP algorithms would undoubtedly advance the studies and applications of robot cognition, collaboration, and interactions.
As humanoids mimic the nature of humans, a survey considering their biological prototypes would be necessary for a systematic review study of the VTFP. On the other hand, revisiting their biological inspirations would certainly be beneficial for the development of vision-tactile fusion systems, which are progressing slowly.
In the Web of Science, Google Scholar, and IEEE Digital Library databases, a collection of 534 publications was found by the keyword searching of VTFP. The abstracts of these publications were read to exclude irrelevant works. Repeated counts were also excluded because some works were included by more than one database, and some works were included  more than once for both conference and journal publications. Meanwhile, survey papers and book chapters that focused on the review of relevant work were further excluded. Therefore, 86 publications were read in detail. During the reading, 21 publications that were referenced by some of these 86 publications were also found to fit the topic of this review. Therefore, 107 publications in total were carefully studied in this work. Fig. 3 shows an overview of the publication filtering process. The number of papers published in the field of VTFP generally increases every year. A total of 67.3% of these works (72 out of 107) are reported in the field of robotics, while 32.7% (35 out of 107) are in the biology area. VTFP is of interest to both robotic and biological scientists.
The paper is organized as follows: Section II introduces the physiological basis of biological vision and the tactile systems. This is followed by the biological VTFP mechanism. Section III surveys robot vision and tactile sensors with respect to biological systems. Neural network-inspired VTFP algorithms and datasets are surveyed in Section III. Challenges and future works appear in Section IV. Section V gives a summary of this work.

II. BIOLOGICAL VISUAL AND TACTILE SENSING SYSTEMS
Vision and touch are the two main sensing modalities that humans utilize to collect environmental information. Studies on the biological sensing mechanisms of humans have inspired that of humanoids.

A. Visual and Tactile Perception
Vision plays an important role in human perception, by obtaining more than 80% of the total amount of information that humans receive from the environment [43]. The human vision system collects environmental information through the eyes. This biological vision system is mainly composed of the retina [44], optic nerve [45], lateral geniculate nucleus [46], visual cortex [47] and middle temporal region [48]. One of the main functions of the retina is the conversion of light signals into nerve signals [49]. The lateral geniculate nucleus (LGN) [50] is located on the diencephalon and metathalamus and has brightness and color information processing abilities. The visual cortex [51] is a koniocortex located in the occipital lobe at the back of the brain. It is responsible for the recognition and motion control of objects with the ventral and dorsal streams.
For the human body, tactile perception is the response of various tactile receptors in the epidermis and dermis caused by mechanical stimulations. Human skin mainly includes four kinds of mechanical stimulation receptors with different structural feature and morphology [52]: Meissner's corpuscle, Pacinian corpuscle, Merkel cells, and Ruffini corpuscle [53], [54]. Their functions are summarized in Table I. RA and SA represent rapidly and slowly adapting receptors, respectively. Type 1 and type 2 indicate small and large sensing area of the receptors, respectively. The tactile receptors in skin tissue encode information about the object that is touching the skin and then transmit it to the brain. In other words, tactile perception is the feeling produced by the human cerebral cortex when the skin is stimulated by the external environment.

B. Vision-Tactile Fusion Mechanism
The human brain autonomously integrates information from a variety of senses to accurately judge and estimate the properties of the surrounding environment. The fusion of visual and tactile information is conducive to the perception and interaction of humans with the environment. Thus, complicated tasks can be completed more efficiently. For example, visual and tactile attention mechanisms are spatially dependent [55]. The detection time of a target by tactile perception can be reduced with the aid of visual information [56]. For texture detection, the combination of eye observation and finger touch works better than a single modal perception [57].
Human tactile perception plays an auxiliary role in the regulation of the visual cortex [37]. Visual modal information can improve the spatial resolution of tactile perception [39]. The visual motion information influences the final position perception of tactile stimuli [38]. The fusion of visual and tactile information can improve the human perception capabilities of the external environment [40]. The above views are also presented by the studies in the following texts.
Macaluso et al. [37] reported that when the left and right parts of the human brain are stimulated by the visual stimuli, the left hemifield visual stimulation activates the right posterior part of the lingual gyrus and vice versa. Their bimodal stimulation experiment showed that the right tactile stimuli enhanced the activation of the right visual stimuli while inhibiting the activation of the left visual stimuli. At the same time, experiments also showed that tactile sensation could regulate the visual cortex through the back-projection of the association region in the parietal lobe. This back-projection mechanism might play an important role in the cross-modal association of spatial attention.
Kennett et al. [39] also conducted a verification experiment to observe the direct influence of the human body under passive touch. When a participant kept his gaze direction of eyes unchanged, the tactile spatial resolution was better when the arm was visible than otherwise. The tactile performance was further improved when the line-of-sight region of the arm was broadened. For human beings, the eyes use binocular disparity and perspective projection to estimate the shape of the object, while the hands judge the shape of an object through touch and proprioceptive cues. Hills et al. [40] demonstrated that the fusion of visual and tactile information could improve the estimation accuracy of object shapes. The fusion of different information from a single perception modality (for example, texture gradients and disparity from vision) weakened the overall information. However, this was not the case when this information was from the different visual and tactile modality.
Studies have been carried out on the multimodal sensory interactions that occur in the primary sensory cortices. Lunghi and Alais [58] attempted to establish visual competition between monocular inputs in the primary visual cortex of binocular fusion, by presenting incompatible visual signals (orthogonal grating signals) to each eye. This caused the ambiguous perceptual responses of the eyes. In the binocular competition, a tactile signal of visual choice was matched. The tactile signal input would affect the visual signals outside the visual awareness. Their experimental results showed that when there was a tactile signal input, the invisible stimulus caused by the suppression of binocular competition would return to awareness sooner. Verhaar et al. [59] conducted a visual-tactile stimulus localization experiment among different age groups. Their results showed that responses were biased toward the location of visual stimulus in all age groups. These findings suggested that the human brain had inferred the possibility that tactile and visual cues had the same cause at a very early age, and used this possibility as a weighting factor in visual orientation. Yang and Lu [60] conducted a judgment experiment on features such as object size by fusing visual and tactile information. They used functional magnetic resonance imaging (fMRI) to perform visual and tactile matching tests on volunteers and observed brain activities at the same time. Their study showed that there were compatible or incompatible senses between visual and tactile sensation. Saito et al. [61] used fMRI to study the neural representation of visual and tactile cross-modal matching of shape information in test subjects in order to explore the location of information fusion with different sensory modalities. They conducted four experiments of tactile-tactile with eyes closed, tactile-tactile with visual input, visual-visual with tactile input, and tactile-visual. The results showed that shape information from different sensory modalities might be fused in the posterior intraparietal sulcus during the visual and tactile matching tasks. Thurlings et al. [62] used event-related potential (ERP)-based brain-computer interfaces (BCIs) to observe the differences in the brain's responses for appraising and ignoring visual, tactile, and visual-tactile bimodal stimuli. They suggested that bimodal stimulus was more likely to lead to the enhancement of ERP components than visual or tactile stimulus alone, thus improving the performance of BCIs.

III. BIOINSPIRED VISUAL-TACTILE FUSION SYSTEMS FOR HUMANOIDS
Like their biological counterparts, visual and tactile perception play an important role in the sensing of humanoids. This section covers the sensors, datasets, and algorithms for information fusion.

A. Sensors and Systems
Sensors and systems convert physical stimuli into electric signals. Various types of sensors have been developed. Here, we focus particularly on biologically inspired ones.
Kramer of ETH Zurich [74] and Zaghloul and Boahen of the University of Pennsylvania [75] proposed the concept of the dynamic vision sensor (DVS) in 2002. Lichtsteiner et al. [76] from the Institute of Neuroinformatics in Zurich proposed the first improved DVS. Its working mechanism was similar to that of the human retina. This type of visual sensor was event-driven rather than clock-driven. They responded to events that occured within the visual range to achieve more uniform event outputs and effectively improving the dynamic range. Lichtsteiner et al. [76] developed the first commercial DVS128 with a sampling frequency of 106 Hz and a spatial resolution of 128 × 128 for target recognition and tracking. IBM's brain-inspired chip TrueNorth [77] used a DVS128 vision sensor for gesture recognition tasks [78]. In 2017, Samsung [79] developed a DVS-G2 vision sensor with a spatial resolution of 640 × 480 and a data rate of 300 Meps for unmanned aerial vehicles and automatic vehicles.
The asynchronous time-based image sensor (ATIS) [80] introduced the light intensity measurement mechanism to the basic structure of DVS to realize image reconstruction. Its light intensity measurement circuit started to work upon an event generated by the DVS circuit. Posch et al. [81] developed a commercial ATIS in 2011. This ATIS had a sampling frequency of 10 6 Hz and a spatial resolution of 304 × 240. Prophessee and Intel [82] further developed a self-driving car based on ATIS.
Brandli et al. [83], [84] developed a dynamic and active pixel vision sensor (DAVIS) in 2014. It was designed by adding the active pixel vision sensor to the DVS for texture imaging. Therefore, it had all the advantages of DVS and Active Pixel Sensors (APSs) at the pixel level. Moeys et al. [84] developed a DAVIS346 sensor with a sampling frequency of 10 6 Hz and a spatial resolution of 346 × 260 in 2018. At present, DAVIS is a mainstream bioinspired vision sensor used in many commercial products and academic research, mainly including the DAVIS240, DAVIS346 and color DAVIS346 models [83].
Dong et al. [85] from Peking University developed the first Vidar vision sensor in 2017. It outputted 476.3 MB of data per second with a sampling frequency of 4 × 10 4 Hz and a spatial resolution of 400 × 250. Vidar [85], [86] consisted of an integrator circuit, a comparator circuit, and a photoelectric conversion circuit, which were corresponding to the bipolar cells, ganglion cells, and photoreceptors of the retina in the biological vision system. Since Vidar used an integral visual sampling model to convert the light intensity signals into pulse signals, it could better reconstruct the details than differential sensors such as DVS, ATIS, and DAVIS. Vidar generated pulse outputs regardless of the visual scene, causing redundancy in the amount of sampled data. The details of above bioinspired vision sensors are summarized in Table II. 2) Tactile Sensors: Tactile perception has been a focus of robotic studies due to its physical contact sensing capabilities. The current robot tactile sensors generally include two categories of flexible and modular hard tactile sensors, focusing particularly on material flexibility and multi-stimuli sensing capability, respectively. Similar to that of the biological skin, the soft and flexible characteristics of flexible tactile sensors enable compliant attachments of them on various robot surfaces. They would hardly affect the robots' movements. They also offer a soft interface between the robot and the environment, protecting robots from abrupt collisions. Modular hard tactile sensors adopt the advantages of signal stability and easy access by integrating many types of sensors, mimicking the presence of many mechanical stimulation receptors of human tactile sensing skin.
The human skin structure has been an inspiration for a vast amount of tactile skin designs. Multiple layers are often adopted to achieve the desired sensing capability, performance and application. In [110], Nassar et al. built a 6 × 6 artificial paper skin through the superposition of three layers of sensor networks with pressure, temperature, humidity, proximity, pH, and flow sensing abilities. In [105], Lee et al. designed a 10 × 10 stretchable cross-reactive sensor matrix. This skin showed high sensitivities and fast responses to diverse stimuli, such as strain, pressure, flexion, and temperature. In [111], Lei et al. proposed a multifunctional and mechanically compliant artificial intelligence skin by adding stimuli-responsive hydrogels to a capacitive circuit. This skin had high pressure sensitivity and a stable capacitance temperature response. It thus could perceive gentle finger touches and bending motion. In [112], Li et al. proposed four tactile sensors composed of multilayer microstructures inspired by the human skin. The robot hand integrated with this skin could independently perceive the environment temperature and object temperature to realize accurate object recognition. In [113], Zhang et al. designed a multifunctional tactile sensor by integrating a hair sensor and a skin sensor through co-based ferromagnetic microwire arrays. This sensor was inspired by the structure of human hairy skin, and could be adjusted autonomously in the face of external stimuli. Inspired by the epidermal and outer microstructures of the human fingerprint, Cao et al. [114] integrated materials such as polyethylene, single-walled carbon nanotubes and polydimethylsiloxane to construct a flexible tactile sensor. Chen et al. [115] also built a novel electronic skin system inspired by the tactile properties of human fingertip. It consisted of a subcutaneous fat-inspired fabric-based porous supercapacitor, a fingerprint-inspired triboelectric generator, and an epidermal-dermal inspired hybrid porous microstructure pressure sensor. This sensor had high sensitivity and could detect pressure, sliding speed and direction simultaneously. In [116], Lee et al. designed a flexible electronic skin with very high piezoresistive sensitivity at low power. This skin was inspired by the hierarchical and gradient mechanical structure of the biological skin system, enabling acoustic detection and subtle tactile manipulation of objects.
With the increasing of sensor numbers, the data processing becomes a challenge for large-scale tactile skins. The sensory receptors in human skin encode tactile information as a time interval between voltage spikes of action potentials. Bioinspired data processing studies have been conducted on artificial receptors. Chun et al. [117] introduced a selfpowered mechanoreceptor, which integrated a piezoelectric film and an artificial ion channel with high sensitivity and a broadband stimulus detection function. Such mechanoreceptors could simultaneously realize fast adaptive (FA) and slow adaptive (SA) pulses similar to the human skin. Tee et al. [118] proposed a tactile sensor integrated with a pressure-sensitive foil and a printed ring oscillator. This sensor could convert pressure into a digital signal with a sensing range comparable to that of human skin. Furthermore, Lee et al. [119] introduced human neuromimetic architecture to an electronic skin, inspired by the asynchronous coding. This skin achieved fine spatiotemporal feature addressing for the fast tactile perception of an array size of more than 10,000 sensors. Li et al. [120] proposed an artificial mechanoreceptor with tactile signal coding capability. This skin was composed of a polypyrrole-based resistive pressure sensor with a volatile NbOx memristor to simulate the tactile perception of human skin. Chun et al. [121] proposed an artificial neural tactile skin system using particle-based polymer composite sensors and signal conversion systems. This skin could simulate the human tactile recognition process. It was similar to the SA and FA mechanoreceptors in human skin and could be used for texture prediction. Zhu et al. [122] proposed a pressure sensing device, that could retain relevant information after removing external pressure, imitating the tactile memory of human skin. In [123], Kim et al. proposed a bioinspired wearable electronic device. It consisted of a stretchable capacitive pressure sensor, a resistive random-access memory, and a quantum dot light-emitting diode, corresponding to an artificial mechanoreceptor, artificial synapse, and epidermal photonic actuator of biological system.
The modular configuration is an acceptable solution to cover the entire irregular surface of robots, similar to the skin of the human body. The early electronic skins for robots [124], [125] were large-area sensor arrays with data processing capabilities covering the large surface of a robot. Someya et al. [126] proposed a flexible, stretchable and bendable sensor array with pressure and temperature sensors for the tactile perception of robot. Asfour et al. [127] applied modular force sensors to cover the shoulders and arms of the ARMAR-III robot. Maiolino and his colleagues [128], [129] utilized the RoboSkin with 200 force sensors to cover the surface of a Nao robot. In [24], Mukai et al. successfully established a modular tactile sensing system on the RI-MAN robot, enabling human-robot interaction, such as lifting a dummy. In [130], Iwata and Sugano developed a TWENDY-ONE robot distributed with an electronic skin of tactile sensors on its arms, palms and body. In [22], Cheng et al. proposed a modular robot skin system, which provided human-like skin cells to cover the robot's surface and could effectively process environmental perception data and make corresponding actions.
Event-based signaling was also adopted by the modular tactile skins, like the biological mechanoreceptors. Bergner et al. [131] developed an event generation algorithm for multimodal skin cells and introduced the implementation of event-based signaling for the robotic skin. In [132], Bergner et al. also proposed a multimodal event-driven electronic skin system for robots, which was a large-scale modular tactile sensor system. It enabled robots to achieve efficient tactile perception. Therefore, the skin could be fully integrated with a robot without additional external power or data processing.
There is also a special type of tactile sensor using optical or visual means to achieve tactile perception. Their force sensing is mediated by the deformation of soft materials, which is similar to the human skin's deformation under force. Adelson and his colleagues from MIT proposed Gelsight [16], [133], which obtained the contact surface information by a piece of transparent rubber with a metal coating on one side and then reconstructed the 3D image of a object. In [134], Facebook proposed a tactile sensor called DIGIT, which was inexpensive in price, compact in size and high in resolution. It was miniaturized based on the Gelsight and was mountable on multi-fingered hands. Duong and Ho [135] from the Japan Advanced Institute of Science and Technology proposed TacLINK with a similar sensing mechanism. They installed two coaxial cameras at each end of a robot arm to form a stereo camera, which enabled the 3D position calculation of all marks on the global coordinate system. They also constructed IoTouch [18] using fish-eye cameras to track the white markers on the inner wall of the skin. Winstone et al. [19] from the University of Bristol introduced TACTIP. It replicated the papillae of human skin through visual observation of the biomimetic subdermal structure [136]. The function of the internal camera was similar to that of the mechanoreceptors in human skin, which could be activated by the movement of the papillae pins.

B. Datasets
With the advent of the era of artificial intelligence and big data, an increasing number of studies show great dependence on datasets. Publicly available datasets are favored by many researchers since they facilitate the evaluation and comparison of theoretical research. The visual-tactile data acquisition process is shown in Fig. 4. This paper reviews seven most used public visual-tactile joint datasets, including BiGS [137], ViTac [138], PHAC-2 [139], Multimodal Grasp Dataset [140], TUM Haptic Texture Database [141], GelFabric [142] and ObjectFolder 2.0 [143]. The summaries of these datasets are shown in Table III. 1) BiGS: Chebotar et al. [137] from the University of Southern California, USA, established a grasp stability dataset based on the Vicon system and the BioTac tactile sensor provided by the SynTouch LLC. The dataset contains 1,000 records of grasping experiments on three types of objects: balls, boxes, and cylinders. The successful and failed tags are 54% and 46%, respectively. Bednarek et al. [144] conducted grasp classification experiments on the BiGS dataset to compare the performance of four multimodal fusion algorithms of late fusion, MoE, intermediate fusion and LMF. Rouhafzay et al. [145] retrained the convolutional neural network on the successful cases of the BiGS dataset and proposed a hybrid framework MobileNetV2. Results proved that their pretrained deep convolutional neural network on  visual images could be effectively transferred to the tactile dataset for classification tasks.
2) ViTac: Luo et al. [138] at MIT built the ViTac Cloth Dataset. It contains visual and tactile images of 100 daily clothes. One thousand fabric images and a total of 96,536 fabric tactile images were collected by a Canon T2i SLR camera and a GelSight tactile sensor, respectively. The dataset was established to first fuse and share the visual and tactile characteristics of different fabrics and then to improve the accuracy of fabric texture recognition tasks. Rouhafzay et al. [145] selected 12 kinds of tactile data from the ViTac dataset to retrain and fine-tune their pretrained deep convolutional neural network to ensure the quality of transfer learning. Lee et al. [146] proposed a cross-modal sensory data generating framework using a conditional generative adversarial network to generate pseudovisual data from tactile data or to generate pseudotactile data from visual data.
3) PHAC-2: Researchers [139] from the University of Pennsylvania and the University of California, Berkeley jointly established this haptic adjective dataset. The dataset contains visual images and tactile signals of 53 common household items. The tactile signals of each object were collected by a pair of BioTac sensors mounted on a PR2 gripper. The visual images were captured by a camera from eight different directions. Each object has 24 tactile adjective tags (for example, soft or rough). Chu et al. [139] developed several machine learning algorithms for human-robot interaction studies on this dataset to understand the meaning of tactile adjectives from the perspective of a robot. Similarly, Bednarek et al. [144] performed the tactile adjective label classification task based on the dataset to compare the performance of four multimodal fusion algorithms of late fusion, MoE, intermediate fusion and LMF.

4) Multimodal Grasp Dataset:
Robot dexterous hand manipulation has always been a research hotspot in the field of artificial intelligence. In order to further study the stable grasping method of robots, Intel Labs China and Tsinghua University constructed a multimodal grasping dataset [140] of 10 different objects based on the Eagle Shoal robot hand. The dataset consists of 2550 groups of valid data. The visual data of the object were collected by the RealSense depth camera, while the tactile data were collected by a 16-channel tactile sensor. Sejdić et al. [147] performed the short-time Fourier transform to evaluate the quality of the dataset. Hochreiter and Schmidhuber [148] conducted sliding detection experiments based on this dataset, using the long short-term memory (LSTM) network and the traditional classifiers.

5) TUM Haptic Texture Database:
Strese et al. [141] from the Technical University of Munich established the haptic texture database. The TUM dataset collects texture images and tactile acceleration trajectories of surface materials from foam, fiber, rubber, stone, wood, net, light, textile, paper and fabric. Each material sample has 10 texture images and 10 tactile acceleration trajectories [149]. Each category basically contains 5 to 17 samples. Each training and test set includes 108 surface material texture samples. Zheng et al. [150] used this dataset to compare their proposed framework with seven state-of-the-art frameworks, such as CCA [151], KCCA, Cluster-CCA [152], WMCA, DCCA, DCCAE and DAML [153], in order to verify the visual-tactile cross-modal retrieval framework based on the discriminant adversarial learning. Zheng et al. [154] also proposed a cross-modal learning algorithm for material perception based on a deep extreme learning machine on this dataset. Deep ELM was an algorithm that could efficiently learn high-level features from the input raw data as well as low-level features. 6) GelFabric: Yuan et al. from MIT [142] established another fabric perception dataset called GelFabric. It contains 119 kinds of fabrics, such as polyester, satin, knit, curtain cloth, terry cloth, burlap, oilcloth, and other functional fabrics. Ten color images and 10 tactile images for each fabric were collected by the Canon T2i SLR and Gelsight tactile sensor. The size of the visual and tactile images was manually adjusted to 224 × 224. Zhang et al. [155] proposed and verified a local visual-tactile fusion algorithm for the object recognition of robot on this dataset. 7) ObjectFolder 2.0: Gao et al. [143] from the Stanford University and CMU built a multisensory dataset ObjectFolder 2.0. It was augmented based on ObjectFolder 1.0. ObjectFolder 2.0 consists of visual, tactile, and auditory data of a largescale of common household objects. It contains 1000 implicitly represented objects, each of which contains a complete multisensory profile of the real object. Gao et al. [143] virtualized each object by encoding its intrinsic properties (texture, material type and 3D shape) with an Object File implicit neural representation. Furthermore, they conducted experiments with this dataset on tasks of object scale estimation, contact localization and shape reconstruction. Results demonstrated that the employment of this dataset could effectively reduce the differences between simulation and reality.

C. Algorithms
Many studies [157], [158], [159], [160], [161] have shown that when humans recognize physical information from the external environment, the brain will share and merge the information collected by different sensory organs. Many researchers at home [15], [21], [28], [140], [150] and abroad [16], [138], [142], [162], [163] have also carried out a series of studies on the fusion effect of visual and tactile information, regarding the fusion algorithms. Visual and tactile fusion algorithms can be roughly divided into two categories based on their data fusing strategies: indirect and direct fusion methods. The former is a generalized fusion of visual and tactile information on the basis of previous unimodal perception information. The information of these two modalities exists independently and only play a mutually complementary role. The latter fuses the information of the two modalities by means of data fusion, especially with neural network-inspired algorithms.

1) Indirect Fusion Methods:
The indirect fusion first uses one of the visual or tactile modal information to make a preliminary decision before introducing that of the other modality as a supplementary explanation, thereby improving the performance. Yamada et al. [164] proposed a visual and tactile fusion algorithm that first described the visible part of a 3D object globally through visual data and then improved the detailed features through the local deformable mechanism of the tactile perception model. Ilonen et al. [165] proposed an optimal estimation algorithm for visual and tactile fusion based on the constraint of object symmetry. The visual model was captured in the form of a three-dimensional point cloud. The visual and tactile data were fused by the Iterated Extended Kalman Filter (IEKF). Prats et al. [166] proposed a visiontactile-force fusion algorithm based on virtual visual servoing. This algorithm used the visual servoing method to estimate the initial pose of the object before utilizing the tactile sensor to feed the estimation error back. This fusion algorithm could provide accurate position information for the robot to complete the sliding door pushing task. Yuan et al. [162] proposed an active tactile perception algorithm to identify clothing properties. They used a convolutional neural network VGG16 to select the location to be explored. Then they used another convolutional neural network VGG19 to identify clothing properties from the tactile data.
2) Direct Fusion Methods: Neural network-based methods have been continuously applied to the research of visual tactile perception fusion, promoting the development of direct fusion methods. Liu et al. [15] proposed a joint group kernel sparse coding (JGKSC) fusion algorithm based on the weak pairing problem of the visual and tactile modal data for object recognitions. Compared with the kNN classification algorithm, it had a performance with an accuracy of up to 90%. Luo et al. [138] proposed a fabric texture recognition algorithm for visual and tactile images based on Deep Maximum Covariance Analysis (DMCA). This algorithm used deep neural networks to learn the visual and tactile modal data, obtaining an accuracy up to 90%. Li et al. [163] proposed a fusion method based on a deep neural network to determine the sliding of grasped objects. They used a convolutional neural network (CNN) [167], [168] of pretrained model on ImageNet to extract the features of visual and tactile images. They applied the LSTM network to compare the feature sequences of these two modalities and make corresponding decisions. Cui et al. [169] proposed a 3D convolution-based visual-tactile fusion deep neural network (C3D-VTFN) framework to evaluate the grasping state of various deformable objects with an accuracy of 99.97%. Zhang et al. [170] proposed a fusion clustering algorithm based on the deep autoencoder-like nonnegative matrix factorization framework. It used the depth matrix factorization method under the constraints of the autoencoder-like structure to learn the visual and tactile fusion data. Takahashi and Tan [171] proposed a deep visual-tactile learning algorithm based on an encoder-decoder network and latent variables.
The learning and prediction capabilities of the algorithms are also of interest. Cui et al. [172] proposed a visual tactile fusion learning algorithm based on the self-attention mechanism (VTFSA) to predict whether a robot can perform a stable grasping task. Calandra et al. [173] proposed a visual and tactile fusion algorithm based on a deep multimodal convolutional neural network to adjust the robot's grasping actions. The algorithm model was an end-to-end network that could learn regrasping strategies from the original visual and tactile data. Lee et al. [174] proposed a multimodal representation model based on self-supervised learning to provide rich feedback information for robots to perform complicated manipulation tasks in an unstructured environment. Yang et al. [28] proposed a visual-tactile multimodal fusion model for grasp stability prediction. Before grasping, RGB images collected by the camera were input into the pretrained convolutional neural network. The data collected by the tactile sensor were input into the LSTM network during grasping. The grasping success rate was up to 94%, which was much higher than that of the visual-only algorithms (84%). Dong et al. [175] proposed a lifelong visual-tactile learning (LVTL) framework, which constructed a modal invariant space based on the sparse constraints to capture the internal mapping differences of visual and tactile modalities. Experimental results showed that the performance of LVTL was better than other algorithms, such as ELLA [176], lslMTMV [177], rLM 2 L [178] and L 2 HMT [179].

D. Applications
The applications of visual and tactile fusion perception on robots roughly contain two categories: algorithms for environment perception and algorithms helping robots perform complex tasks. Detailed information related to the applications and algorithms is shown in Table IV. 1) Algorithms for Environment Perception: In the human perceptual system, the information collected by the vision system and the tactile system can complement each other for fused perception. It is the same to robots.
The implementation of tactile information facilitates the 3D reconstruction of objects. Björkman et al. [180] used a depth camera to capture objects in a fixed direction to initially construct an incomplete 3D model. Then, they used the Gaussian process regression to estimate the uncertainties of each position. Finally, they applied the tactile perception on areas with the highest uncertainty to construct the 3D construction. Allen [181] proposed a method for the reconstruction of irregular objects. They firstly determined the shape, size and position of an example hole by vision. Then, they modelled the information by tactile sensors. The work in [164], [165] also proposed fusion algorithm for 3D object reconstruction.
VTFP enhances the object recognition accuracy compared with a single modal perception. Studies of Heller [57] showed that the accuracy of the texture recognition task based on visual-only or tactile-only information was not better than 70%, while it increased by approximately 12% based on visual-tactile fusion. The methods proposed in [138], [162] also obtained higher object recognition scores based on visualtactile modal information fusion.
Delicate manipulation is another application of VTFP on robotics. It takes both advantages of the object and force recognition capabilities of this algorithm and the motion execution capability of robot. Cui et al. [169] presented a stable grasping adjustment strategy for deformable objects achieving an accuracy up to 99.97%. Moreover, the VTFP can aid object pose identification for grasping when it is obscured. Lee et al. [174] used the one-dimensional force signal from the tactile sensor and the RGB image to train a CCN network and to evaluate the alignment state of different wedges and grooves, obtaining an average success of 78.7%.
2) Algorithms Helping Robots Perform Complex Tasks: Many researchers have recently applied visual-tactile fusion methods to conduct a series of complicated robot tasks. Agravante et al. applied a fusion algorithm [182], [183] to aid the human-robot collaboration, allowing humans and robots to cooperate in the task of moving a table while avoiding objects from falling. The robot used the visual and tactile sensors to obtain the pose of the table and the objects on the table and human action intention, respectively. Dong et al. [175] applied the VTFP to complete the stability control of square objects and spheres, which could be applied in daily life and working scenarios. Prats et al. [184] developed a librarian robot. It used a CCD camera to obtain the label of a required book, and then could be guided to remove the book without affecting the surrounding books through a combination of visual and tactile sensor information feedback.
Kudoh et al. [185] developed a painting robot by using a visual-tactile fusion control method to realize the control of the pen by robot fingers, including the tilt angle of the pen tip and the friction between the pen tip and the drawing paper. This robot successfully depicted the two-dimensional contours of a man and an apple.

IV. CHALLENGES AND FUTURE WORKS
As discussed above, the application of VTFP has promoted the environment perception capability and complex task performance of robots. However, it also faces  [171] several challenges, ranging from sensors to algorithms and applications.

A. Sensors and Systems
Currently, the types of sensors used for VTFP are limited. Among the publications, that specified visual sensor types, more than half of them applied traditional CCD and CMOS sensors. Only one study [186] used the DVS neuromorphic camera to improve the accuracy of the external information judgments. Similarly, commercially available resistive tactile sensors are the most widely used tactile sensors, accounting for more than one-third of the publications. Increasing sensor diversity, especially bioinspired sensors, may bring new possibilities to related research due to their special characteristics.

B. Datasets
The current datasets were collected mainly from fabrics, household items and geometric objects, which are few in number and small in size. The datasets can be enriched by increasing the number of objects or through artificial intelligence methods such as the Generative Adversarial Network (GAN), which has been widely studied and used in visual-based research. Most of the visual and tactile data were collected separately. The studies, that collected visual and tactile data simultaneously, only used sensors on robot hands or end-effectors.
It is quite different from their biological counterparts that use eyes and tactile sensors for real-time fusion.

C. Applications of VTPF
Tactile perception of human organisms includes three-dimensional forces, stretch, temperature and vibration. Various tactile sensors are spread all over human body. Therefore, the organism can perceive the environment through the fusion of tactile and visual information of the whole body. In contrast, the VTPFs of robots rely largely on pressure sensors (accounting for more than half of the literature). Moreover, the number of tactile sensors for robotic VTPF is small. More than 50% of the studies used fewer than 10 tactile sensors, which were mainly installed on the grippers. Therefore, the current applications of robotic VTPF are mostly in relatively simple tasks, such as delicate manipulation and object recognition. Nevertheless, the number of tactile sensors on robots are increasing. For example, the number of tactile sensors covering the H1 robot surface proposed by TUM has reached 1260 [22].

D. Multiperception Fusion
Robot perception in complex environments for complicated tasks may require fusion of multimodal sensing, such as visual, tactile, auditory, olfactory and gustatory. When sensors and application scenarios are different, the choice of fusion strategy is a challenge. The performance of machine learning-based fusion algorithms suffers from poor transfer capabilities.

V. CONCLUSION
This paper first reviews the physiological basis of biological vision and tactile systems and the biological vision-tactile fusion mechanism. After that, the relevant principles of typical bioinspired vision and tactile sensors are surveyed. Several vision-tactile fusion algorithms and publicly available datasets are reported. Compared with the single vision-or tactile-based methods, the algorithms based on visual and tactile fusion show better performance. In addition, this paper classifies and summarizes the applications of VTFP to robots. The challenges and future works of the VTFP and its applications to robots are discussed at the end of this review. This paper provides a systematic review of the VTFP, including the biological mechanisms and inspirations, robot sensors, fusion algorithms and datasets, as well as its applications to robots. Hopefully, this survey will be of use to practitioners designing VTFP systems and to researchers working on humanoid robotics.