High-Speed Tactile Braille Reading via Biomimetic Sliding Interactions

Most braille-reading robotic sensors employ a discrete letter-by-letter reading strategy, despite the higher potential speeds of a biomimetic sliding approach. We propose a complete pipeline for continuous braille reading: frames are dynamically collected with a vision-based tactile sensor; an autoencoder removes motion-blurring artefacts; a lightweight YOLO v8 model classifies the braille characters; and a data-driven consolidation stage minimizes errors in the predicted string. We demonstrate a state-of-the-art speed of 315 words per minute at 87.5% accuracy, more than twice the speed of human braille reading. Whilst demonstrated on braille, this biomimetic sliding approach can be further employed for richer dynamic spatial and temporal detection of surface textures, and we consider the challenges which must be addressed in its development.

A particularly promising approach is that of vision-based tactile sensors [9].By positioning a small camera behind a deformable surface, features can be detected at high resolutions and speeds, all within an environment where the lighting can be carefully controlled [10].Recent setups can be scaled to fingertip-size, facilitating their use for in-hand manipulation tasks [11], [12].
However, these sensors are typically employed to return static images from discrete areas of interest [13], [14].As well as being slower than a biomimetic approach in which the fingertip is instead slid along a surface (Fig. 1), these discrete images do not include information about how the surfaces dynamically interact, which is a vital consideration during tactile interactions [15], [16].As cutting-edge e-skins begin to incorporate additional biomimetic functionalities, the demand for human-like dynamic interactions such as sliding becomes more essential in robotic designs [17], [18].
A few existing works have used vision-based tactile sensors to measure stimuli dynamically: slip detection and friction coefficients [19], [20].Others have used event-based approaches for high-speed visuo-tactile measurements [21].Cao et al. [15] and Shimonomura et al. [22] increase measurement speeds by introducing roller-based sensor designs for continuous data collection.However, roller designs are not suited to general-purpose designs, such as the fingertips of a robotic hand which must also perform fine in-hand manipulations.
At sufficiently high sliding speeds, a vision-based tactile sensor's output will be non-negligibly affected by camera artefacts such as motion blur and rolling shutter effects [23].Though traditional computer vision approaches can begin to address these artefacts [24], [25], more challenging scenarios require multiple cameras [26], [27] or interpolation between high-speed frames [28].Alternatively, synthetic blurring can be employed, though care must be taken to select and train with kernels which sufficiently match the real-world effects [28], [29], [30].
In addition, the selection of the soft material placed in front of the camera can have a significant effect on the returned data [31], requiring the matching of material composition to sensor functionality for optimized results [9], [32].At high speeds, the friction accompanying sliding interactions can cause increased wear on the sensor's surface.Protective surfaces, such as tapes, can be used to maximize longevity but may also introduce diffusive artefacts, making the dynamic frames more difficult to interpret [14].
In this work, we focus on the high-resolution and highly dynamic task of braille reading, aiming to achieve high speeds via a biomimetic sliding approach [33].To maximize functionality, we use a pre-existing DIGIT sensor [12], designed for in-hand manipulation.We address the motion artefacts which emerge at sliding speeds challenging its 60 fps frame rate, with an approach which could straightforwardly be applied to higher frame rates and sliding speeds.Our approach significantly outperforms the speeds previously achieved in the literature (which are discrete or slowly sliding at 15.5 wpm [13], [34], [35]) and by humans: 120 wpm [36] Our work is a vital step in the transition from discrete to biomimetic sliding interactions during tactile sensing, facilitating faster and more information-rich tactile sensor data.Biomimetic sliding interactions are associated with motionblurring effects, which we remove using an autoencoder trained on a set of real static images which are synthetically blurred.From these quickly-collected frames, a classifier detects the most prominent braille characters, before a consolidation step combines the redundant information from multiple frames to minimize the error of the predicted string.With this pipeline, our sliding fingertip can read braille sentences with 87.5% accuracy at 315 words per minute (wpm) whilst covered with a protective tape surface.

II. MATERIALS & METHODS
Our dynamic braille reading approach consists of 4 key stages, displayed in Fig. 2(a): data collection, in which a commercially available vision-based tactile sensor is moved at speed across a refreshable braille display and its output frames recorded; autoencoder deblurring, which removes motion artefacts from each frame; classification, which predicts the braille characters present; and consolidation, which combines multiple predictions into a final output string.The following sections describe each stage in more detail.

A. Dynamic Data Collection
To record data dynamically, a commercially available DIGIT sensor [12] is attached to the end of a Universal Robots UR3e arm (Fig. 2(b)), which provides precise control of the sensor's position and speed.For our experiments, we constrain the motion of the arm, and therefore the sensor, to sliding across a single row of braille characters.Standard-size braille characters are provided by the Orbit Reader 20, a commercially available refreshable braille display.This display is programmed to show a custom text file which is saved and read from an SD card.If the sensor is left to interact with this display over multiple highspeed runs, the reflective silicone paint on its surface is eroded away.To increase the durability during dynamic interactions, we wrap the soft material with 3 M Micropore medical tape.This also provides the benefits of: reducing the friction coefficient of the sensor's surface, allowing for the smoother sliding; and reducing the amount of light leakage through rips in the paints.We examine the effect of this tape on the returned frames in Section III-A.A calibrated strain gauge is also mounted between the sensor and robot using 3D printed adapters, allowing the normal force of the sensor on the display to be recorded.
As the speed of sliding increases, two main motion artefacts appear in the frames: motion blur, the streaking of moving objects in an image due to a long exposure; and the rolling shutter effect, an apparent stretching or compressing of moving objects due to frames being captured by scanning across a scene, giving an output which is dependent on the direction of motion [37].The higher speeds of Fig. 2(c)'s example frame show obvious blurring, whilst the rolling shutter effect can be seen by comparing the upper and lower rows, which are taken by sliding the sensor in opposite directions.In the upper row, we slide right to left, revealing the compression of both the dot radii and the distance between them, introducing braille dots (such as shown in the white dotted circle) that cannot be seen in the static image.In contrast, the lower row slides left to right, the dots and distance are elongated, so some parts of the static image are not seen in the moving frames (blue circle).
In this work, we address both artefact types in order to get rich dynamic information from textured surface: an autoencoder is trained to correct motion blur, whilst a consolidation stage aims to minimize the errors introduced by rolling shutter artefacts.

B. Deblurring Autoencoder
In order to convert the raw blurred frames into classifiable braille images, we build and train a deep deblurring autoencoder using synthetically blurred static images.Its architecture is shown in Fig. 3: firstly a deep network consisting of 3 downsampling encoding layers, using 2D convolutions and ReLU non-linearities, learns an efficient compression of resized Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.motion blurred images.The latent space representation of these images are produced via a 2D max pooling layer.To decode this representation, 5 decoding layers, using 2D transposed upsampling convolutions and ReLU activations, are trained to produce a sharp RGB image.
where Y , X and n represent the blurry image, the latent sharp image and additive noise respectively.f represents a PSF (Point Spread Function) kernel.The blur kernels are constructed by filling a 50 × 50 kernel with a centred ellipse with small varying thickness, length and direction of major axis (Fig. 4(b)).Due to the single axis motion of sliding in our application, linear kernels provide sufficient blurring without compromising accuracy [38].
Using the 1000 sharp images previously collected, we generate a new synthetic dataset consisting of 21000 sharp/blurry image pairs which are then resized into 128 × 128 to input into the autoencoder, with a 0.8/0.2train/validation split.For each original sharp image, we also create a pair with no added blurring, resulting in an identical input and output.This is done to improve the autoencoder performance on minimally blurred images, which can be seen at the start of sliding motions, where the robot arm is accelerating from rest.Using synthetic blur avoids the hardware constraints associated with physical techniques, allowing for the generation of much larger datasets.However, the Sim2Real gap must be carefully considered, as the 'true' blur kernel is not known and many assumptions of blurring method are made.To minimize this gap, we provide the autoencoder with a large variety of artificial data.By varying the blur kernels that the autoencoder model encounters during training, we ensure robust performance for different directions and speeds of sliding, as well as its application to real-life motion blur in collected frames during sliding.

C. Braille Character Classification
Having deblurred the frames, we next look to classify any braille characters that are present.To achieve this a lightweight object detection neural network, with a YOLO v8 architecture, is trained with 450 labelled real sharp images.These images are taken from the autoencoder's 1000 training images (Section II-B), and are manually labelled with bounding boxes and class labels.When training, the images are augmented with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.rotations, added noise and varying blurs.This gives a total of 1170 examples, which are split into a 0.8/0.2ratio for training.This augmentation step attempts to increase the robustness of the classification model, which is needed to better classify the outputs from the autoencoder.We increase robustness by only allowing a maximum of 1 classification per frame, choosing the character associated with the highest confidence.During sliding, this approach deals with the case of multiple characters appearing in the same frame.

D. Character Consolidation
Fig. 2(a)'s final step is consolidation -combining multiple frames into a single prediction.When detecting and classifying braille symbols during sliding, we expect the same character to be seen in multiple frames.We also expect there to be some incorrect detections of characters, mostly when a character is entering or leaving the frame, but is partially visible.To address this, we first model a single row of raw predicted text, P r , as: where R and T represent a repetition factor (dependent on sliding speed) and the latent true letter respectively.E T represents incorrect classifications for that true letter.We propose a databased error correction algorithm to reduce the effect of R and E T , using a normalized confusion matrix.The confusion matrix is calculated in Section III-B, after which our consolidation method is described in Section III-C.We measure the final accuracy score by comparing the errorcorrected predicted text to the ground truth.We look at each output letter, and the letters immediately adjacent, when comparing.This avoids any offsets skewing the accuracy measurement.The speed of the system is calculated by measuring the time taken for the sensor to complete 1 row of braille, which in our braille display contains 20 braille characters.We convert this characters per second (cps) speed into words per minute (wpm) by assuming an average of 5 characters per word in the English language [39].

A. Preliminary Characterization
Before collecting the process's training data, we first examine the normal force required to produce a clear image with the sensor; a range of realistic forces is applied to the braille display, and the average pixel intensity at the centre of a braille dot is recorded: Fig. 5(a).After a threshold force of approximately 0.1 N, the camera begins to see the braille dot.As the force is increased in regular intervals, we see that there is a substantial linear region between average pixel intensity and the force (stress) applied to the contact material.For reliable results, we operate within this region in subsequent tests, using a force of approximately 2 N.
We next confirm that the application of the medical tape (Section II-A) does not prevent braille from being read.To do so, we record the sensor's response as it is moved in an up-and-down motion onto a braille character.Fig. 5(b) and (c) plot the standard deviation of each pixel during 5 cycles for the 'taped' and 'untaped' cases, respectively.Without tape, the characters are clearly visible (and a rip in the paint can also be seen), and there is little change near the braille dots.Conversely, the diffusive characteristic of the tape can be seen by the greater change of pixel intensities in Fig. 5(b).This is also apparent in Fig. 5(d) and (e), which plot the intensities of two pixels in both cases: one directly on top of a braille dot (d), and one in a region further from the character (e).Around braille dots, the peakto-peak amplitude is lower with tape on, but there is a higher average intensity (due to the diffusive effect).Outside the braille character, this remains true, but there is a greater difference in average intensities.Adding the tape reduces the effect of the rip in the paint (Fig. 5(c)), demonstrating its ability to increase the sensor's robustness.Given this durability advantage, and the clear images still visible with the taped sensor, a taped surface is used for all subsequent tests.

B. Autoencoder and Classifier
The performance of the autoencoder and classifier is evaluated both qualitatively and quantitatively.In Fig. 6(a), we present example blurry inputs, and the deblurred and classified outputs, at a reading speed of 356 wpm.The deblurred images provide a sharper image, making it easier to see the underlying character.We show that the YOLO v8 braille classifier, which is trained on sharp images, works well in classifying the autoencoder output compared to the ground truth.We also show examples of incorrect outputs due to failures of both the autoencoder and the classifier.The autoencoder can fail to reconstruct the correct symbol (due to very high speed or similar characters) or the classifier, which is trained on a fairly small dataset, can incorrectly classify an adequately deblurred image.This is why Section II-D's consolidation steps help to improve the final prediction.
In Fig. 6(b), we quantify the performance of the braille reading system by generating a confusion matrix, here showing the likelihood of the true letter being detected as different predicted letters.We create this matrix by reading 3 rows of each letter at 356 wpm, and then measuring the proportion of each letter that is classified over all returned frames.This matrix suggests that the system is highly accurate and robust, by the clear diagonal line, as well very few incorrect classifications.The highest incorrect classification, between 'L' and 'P', is most likely due to the limitations of the classifier network, because this error is not symmetrical about the diagonal.It is much more likely for a 'P' to be classed as an 'L' than vice versa, which suggests that it is due to class imbalance in the training data of the classifier, rather than the autoencoder mixing them up.An example of this would be the letters 'K' and 'X', which look very similar (as shown in Fig. 6(b) and (c)), especially if motion blur is present.Therefore, the confusion for these letter are symmetrical about the diagonal, suggesting the autoencoder is more likely to be the point of failure.Fig. 6(c) shows the difficulty in distinguishing between similar characters, even for the human eye.The characters can also occupy different regions of the frame, and therefore different lighting conditions.Due to this reason, our lightweight classification model can fail.Gathering more data for training would decrease these errors in the confusion matrix, as well address the class imbalance.More data could improve the autoencoder, but it is likely to be limited by the simple architecture, which could be further tailored to this application.

C. Entire System
A normalized version of Fig. 6(b)'s confusion matrix is used in the final consolidation stage: given a prediction, the matrix returns the probability that it is correct.Since each character is seen in multiple frames, we calculate a score for each letter  in a text block.The letter with the highest score is assigned to that block, along with the score.The process is described in Algorithm 1, in which the f loor and ceiling terms stem from the fact that the total number of frames that are captured and classified, and therefore the length of predicted text, is dependent on sliding speed.As the length of the raw text can take any value, the ratio between the raw text length and the length of the ground truth may not be an integer.We therefore define 2 discrete block sizes (values of R), as the floor and ceiling of this length ratio.Text blocks of these sizes are generated for each letter in the raw text.
To extract the final error and length corrected string, we choose the combination of blocks that has the highest confidence -Algorithm 2.
The high-speed ability of this complete dynamic system braille reading is demonstrated by reading 3 different texts.They are chosen as they contain all the letters of the alphabet as well as a large variety of letter combinations.Both the type of characters and the arrangement is important when reading texts.Table I evaluates the system at a variety of reading speeds for the pangram 'thequickbrownfoxjumpsoverthelazydog.'The first string for each speed is the raw predicted output text.The Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The consolidation stage is particularly affected by the arrangement of characters, as it looks at the whole sentence, rather than individual frames.To further validate our algorithm, we therefore test it on more texts using generated sentences, all at the same speed.We use (2)'s model with the confusion matrix to generate pre-corrected sentences.In Table II, we show the accuracy of our generation technique by comparing the generated text to the real text, at approximately the same reading speed (356wpm).For Text 2 in the table, we read 'Asimov's Laws of Robotics' [40] with the system, a longer text than the pangram, and see that our generated text has a similar score after the same decoding method is applied.This provides further evidence that the generated texts closely resemble the true output from the system.

TABLE II COMPARISON OF REAL TEXT AND GENERATED TEXT
With this method, we read a much longer text, Carroll's 'Alice's Adventures in Wonderland', which gives a better representation of performance of the consolidation stage, due to its length and breadth of letter combinations.We plot the average accuracy (of 5 runs) for variations in the confusion matrix used to generate the 'raw' text in Fig. 7.For each run, we add a random amount of error to each value in the pre-normalized confusion matrix, which in turn adds more error to the generated text.We choose an error range comparable to the median of the confusion matrix, since small changes in speed add small errors to the confusion matrix.In this plot, we see the expected decrease in accuracy as error is added, but we note that the accuracy remains high even for errors up to 2%.Therefore, our consolidation stage (using the original confusion matrix in Fig. 6) is robust to small errors, both for different texts and speeds.
In Fig. 8, we plot the average accuracy of braille reading at different speeds, each averaged over 5 trials.The advantages of sliding are immediately clear: sliding allows for much faster speeds of reading, whereas discrete braille reading has a maximum of 14 wpm, limited by the mechanical speed limitation of the UR3e arm in discrete movement.Though the autoencoder's accuracy appears to increase in the first two measurements, this falls within the error bars of each measurement, and we assume this to be an artefact of experimental variations.
Furthermore, use of the autoencoder to deblur the raw blurry images greatly increases the accuracy of the predicted text in continuous reading.Surprisingly, the accuracy achieved by only applying the classifier to the raw images gives high accuracy at speeds near 100 wpm.This reveals the effectiveness of the classifier and the data augmentation during training.When the deblurring autoencoder is applied, we achieve much faster reading speeds for the same accuracy.The advantage of using the autoencoder is most prevalent for higher speed ranges, where the effect of motion is blur is much more pronounced.

IV. CONCLUSION
In this work, we present a complete pipeline for dynamic braille reading using vision-based tactile sensors, demonstrating the effectiveness of data-driven artefact removal in increasing characterization performances at all sliding speeds.We present a deblurring autoencoder, trained on artificially blurred images, to increase the clarity of blurred frames captured during sliding.Our lightweight braille detection classifier, based on a YOLO v8 architecture, is trained on augmented real images and used to classify the deblurred images.We demonstrate the improved accuracy of the combined stages, as well as a substantial increase over discrete reading.
With our data-driven approach, we achieve highly accurate and fast braille reading for a large range of speeds, including an accuracy of 87.5% at 315 wpm.This is significantly faster than previous research, and the approach can be scaled with more data and more complex model architectures to achieve better performance at even higher speeds.
Beyond braille reading, these results demonstrate the high potential of dynamic tactile interactions for high-speed data collection.More universal implementations would require the development of generalized deblurring & classification stages, tuning autoencoder parameters, as well as material optimizations to balance the effects of durability, resolution, stiffness, and viscoelasticity.Given the order-of-magnitude increase in braille reading speed which our system achieves, such directions are a very promising route for future work.

Fig. 1 .
Fig. 1.(a) Biomimetic braille reading: bridging the gap between human and robotic tactile sensing.(b) Reading each letter individually requires minimal processing, but is slow and returns no dynamic information.(c) We instead explore a biomimetic sliding approach, and propose a complete pipeline to return the predicted text.

Fig. 2 .
Fig. 2. (a) Process overview: Converting blurry braille frames into a predicted output string.(b) Experimental setup: a DIGIT sensor covered with medical tape is moved across a refreshable braille display.(c) Example frames captured by the sensor at different speeds and directions, with the medical tape covering.

Fig. 3 .
Fig. 3. Autoencoder architecture used to deblur the frames captured while sliding across the braille surface.The captured frames are resized and input as 3-channel RGB images, outputting a deblurred frame.

Fig. 5 .
Fig. 5. (a) Linear relationship between the average pixel intensity and applied force.(b) & (c) Standard deviation of up-down frame sequence for tape and no tape respectively.(d) & (e) Plots of the change in average pixel intensity during dynamic testing for tape and untaped cases.

Fig. 6 .
Fig. 6. Results of autoencoder and classifier system at a speed of 356 wpm (a) Example images for each step of braille reading, as well as points of failure.(b) Confusion matrix showing classification rate.(c) Raw visualisations of the 2 most common errors identified by the confusion matrix.

Fig. 7 .
Fig. 7. Average score of generated text, reading 'Alice's Adventures in Wonderland'[41], while varying the error of the confusion matrix.

Fig. 8 .
Fig. 8. Accuracy of predicted text speed of reading.Discrete reading speed is show with red cross.Blue line shows accuracy when using deblurring autoencoder + classifier.Dashed green line shows accuracy for classification directly on the blurred image.

Algorithm 1 : 2 : 1 20blocks
Assign Each Block a Letter and Score.Input: p ← Raw predicted text Output: X ← dictionary of block:(letter, score)t ← Ground truth text n ← length(p)/length(t) R 1 ← f loor(n) R 2 ← ceiling(n) for all possible blocks of size R 1 or R 2 dofor each letter in block do score ← probability of being true letter end for Choose letter with maximum score X[block] ← (letter, score) end for repetitions and noise that is seen in these strings match the model we assign in(2).The second string shows the error and length corrected string that we use for measuring the accuracy of the Algorithm Maximise Confidence of Block Combinations.Input: X ← dictionary of block:(letter, score) Output: y ← error corrected text for all possible combinations of blocks in X do confidence ← score of block end for y ← combination with maximum confidence system.It is clear that our error correction algorithm is robust for different speeds reading this text.