Deep Learning-Based Hypothesis Generation Model and Its Application on Virtual Chinese Calligraphy-Writing Robot

,


I. INTRODUCTION
In recent years, artificial intelligence (AI) has dramatically affected human's life in many areas such as security, domotics, automatic system, face recognition, object recognition, market analysis, to name a few of them. Most of these research studies concern artificial narrow intelligence (ANI). However, devices, machines, and robots which need to adapt to a changeable environment require deep thinking and complex perception to handle uncertainties and make correct decisions. As a result, artificial general intelligence (AGI) is becoming an important topic for investigation by many researchers. AGI is a kind of strong AI which attempts to The associate editor coordinating the review of this manuscript and approving it for publication was Aysegul Ucar . model human cognition and human mind. One of the key elements of AGI's kernel is the cognition system. Traditionally, cognitive psychology includes several parts, e.g., reasoning, memory, and perception. Among them, hypothesis generation model [1] is an important research topic for reasoning as to how a human makes decisions by generating possible states based on historical experiences to solve a problem. In a hypothesis generation structure, the decision maker requires the actual state of the world in order to rectify the behavior if the current state is wrong. To the knowledge of the authors, most research investigating hypothesis generation model is probability-based. That is, the posterior distribution is calculated to make new inferences based on historical experiences [2]. However, computation of human brains is nevertheless neuron-based instead of calculating the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ probability. As an attempt to solve this problem in this paper, we propose a novel neuron-based hypothesis generation model, called hypothesis generation net, to model human cognition, including how to make decisions and how to do actions.
In the last few years, deep neural networks have made a series of breakthroughs. They are widely utilized in images classification [3]- [7], objects detection [8], as well as voice synthesis or image translation [9], [10]. Autoencoder (AE) [11], [12] is a kind of unsupervised learning neural network which learns and extracts features automatically. The hidden layer of AE consists of two parts, i.e., the encoder and the decoder. The aim of the encoder is to compress an input into a set of latent vectors. Then, these latent vectors can be processed by a decoder to reconstruct the input. Traditional AE is usually utilized for dimensionality reduction or feature extraction. In recent years, AE has been widely applied in generating images, including converting picture colors, removing watermarks, denoising images, etc. As a result, there have been various types of research on autoencoder, such as variational autoencoder [13], denoising autoencoder [14], sparse autoencoder [15], etc. Another related method in unsupervised learning is generative adversarial networks (GANs) [16]- [18], which utilize a discriminator model to classify output images into 'real' or 'fake' and utilize a generator model to produce 'fake' images which the discriminator model cannot distinguish from 'real' images. The GANs model has inspired many subsequent works for image synthesis, such as DCGAN [19] and Deepfake algorithm [20], [21], which can swap one person's face with another in a video or an image. Motivated by AE and GAN, a neuron-based hypothesis generation model is established in this paper. Through deep learning realization, the proposed hypothesis generation model has the ability to learn and generate hypotheses by practicing based on historical experiences, addressing the problem of image to action translation.
To validate the feasibility of the proposed hypothesis generation model, we show a virtual robot with its cognition system can learn how to write Chinese calligraphy in a simulation environment through thinking and practicing from a human writing sample. Chinese calligraphy writing, which is regarded a difficult task requiring extremely complicated motions [22]- [25], focuses on changing the speed, press, strength, orientation, and angle [26] of a writing brush to write aesthetic calligraphy. It is complicated for designers to analyze the strokes of characters in different styles. Therefore, profound skills are needed to write Chinese characters well. Pressing the brush heavily or lightly causes the stroke of the Chinese characters to become thick or thin, respectively. Moreover, the turning angle and timing for manipulating the brush are also important. Given the challenges, there have been researches focusing on the development of Chinese calligraphy-writing robots [22]- [31]. To simplify the tasks required, most of image-based researches utilized 3-axis vector [x, y, z] to control the robot to write Chinese calligraphy [24]- [29] because 6-axis [x, y, z, roll, pitch, yaw] motion planning for Chinese calligraphy writing is a complex task for robots. It is intuitive to extract the position component [x, y, z] from a Chinese calligraphy character by skeletonization and thickness of the calligraphy characters. However, the orientation and tilt of the writing brush are much more complicated to calculate because Chinese calligraphy characters can be written with many different motions. That is, different motions can achieve the same writing result. The relationship between motion and writing result is not a one-to-one, but a many-to-one mapping function. While the generation of position vector sequences for the writing brush is straightforward through machine vision operations, the combinations of orientation and tilt sequences, however, are extremely numerous for the writing brush. Therefore, it is difficult to generate coordinates of roll, pitch and yaw of the writing brush from a human writing sample by directly using computer vision methods. In light of the above difficulties, it is therefore our objectives to apply the proposed neuron-based hypothesis generation model to a virtual robotic system through a simulation environment where the virtual robot with its cognition system can learn and think how to write Chinese characters well by practicing.
The rest of this paper is organized as follows. Section 2 discusses the related works. Section 3 presents the proposed hypothesis generation model. In Section 4, we apply the hypothesis generation model to a virtual robotic calligraphy writing system. Simulation results are shown in Section 5 to verify the performances of the proposed method. The conclusions are drawn in Section 6.

II. RELATED WORKS
To build an artificial cognitive system to model the hypothesis generation process, every single neuron of deep neural networks is important. By connecting multiple neurons, we can construct a system to simulate the structure of a human brain to fulfill the function of reasoning and judgement. Without hypothesis generation processes, the system is not able to understand the surroundings and learn by itself. Therefore, deep neural networks are utilized to realize the hypothesis generation process to model the psychological learning process of human beings to accomplish different types of tasks.
In a hypothesis generation model, most investigations indicate that the hypotheses made by humans come close to the Bayesian model [1], [21], [32], [33], where inference comes from hypothesis generation and evaluation as: where H is a complete set of hypotheses, h, h ∈ H , d is the sensor input, P (h|d) is a posterior probability to hypothesis h, P(h) denotes its prior probability, and P (d|h) represents the likelihood of the sensory input data under hypothesis h.
Because H is a complete set of hypotheses, it is impossible to generate the whole space of hypotheses in many cases.
To solve the approximation of posterior probability with less biases coming from the incomplete hypotheses, Markov chain Monte Carlo (MCMC) method can help approximate the posterior probability by (2) as: where f (·) is 1 if the statement is true, otherwise is 0. h n is a random sample hypothesis from the Markov chain. If N goes to infinity, we obtain a non-bias approximation of the posterior probability. However, the computing units in human brains are neurons. That is, the decision, memory, and perception come from a central nervous system. Even though much research supports that MCMC can also be explained with neuroscience as cortical circuits, the hypothesis generation from humans can be regarded as a complicated neural network. Actually, all of the hypotheses are from neural computing in human brains. It is therefore possible for us to design deep neural networks to simulate the hypothesis generation process. Referencing to the concept of AE and GANs, we thus propose a neural network architecture to model the hypothesis generation process.
AE is a type of unsupervised learning, which was first introduced by Zhu et al. [10]. The method is utilized to compress an input into a latent vector via an encoder. This latent vector usually presents the important part of the data. The encoder also has the function of nonlinear dimension reduction. After that, the decoder utilizes the latent vector to reconstruct the input data. Comparing the inputs with the outputs, we learn the weights of the encoder and decoder according to the loss function n i=1 (x i −x i ) 2 /n. Fig. 1 shows the schematic diagram of the autoencoder. Ng [15] introduced GANs, which are deep neural net architectures for training a generative model via an adversarial process. GANs consist of two nets, i.e., a generator net G and a discriminator net D. The generator G generates samples from a prior noise distribution and the discriminator D is trained to distinguish whether the samples come from the real data distribution or the generator's distribution. The generator is then trained to compete with the discriminator D by minimizing log(1−D(G(z))), so that the discriminator is unable to distinguish whether the samples are real data or generator's data.

III. HYPOTHESIS GENERATION MODEL
To describe the proposed neuron-based hypothesis generation model, we utilize a virtual robotic system as an application to better illustrate the derivation process. Through the realization of the hypothesis generation process by neural networks, we can construct a cognition model for the robotic system to make hypotheses from historical experiences. Based on the hypothesis generation model, the virtual robot can learn how to write Chinese calligraphy. Instead of using top-down strategy to learn writing Chinese calligraphy, we utilize bottom-up strategy to build the cognition architecture to learn with the neural networks. Fig. 2 illustrates the architecture of the proposed hypothesis generation model for application to a robotic system, which consists of two parts, i.e., a hypothesis model and an evaluation model. The hypothesis model makes hypotheses to solve the problem according to the past experiences stored in DNN1. The function of the evaluation model is to judge the hypotheses. The result which the virtual robot observes is stored in DNN2 so that the virtual robot can recall the result and the historical experiences in the future to help DNN1 make a new hypothesis by judging the previous hypothesis from DNN1. For instance, when we need the virtual robot to pick a bottle, the hypothesis model produces an action vector as the angles for controlling the motors. Then, we close switch s1 so that the virtual robot can execute the action vector which is received from DNN1. Then, the evaluation model stores the result and the hypothesis in DNN2 by closing switch s2. If the observed vector O t is not ''pick a bottle'', the hypothesis model needs to make a new hypothesis according to historical experiences. To make a new hypothesis, we connect DNN1 by closing switch s3, which makes the next hypothesis. DNN2, which stores historical experience, helps compute the gradient of the error with the vector m t and the expected observed vector O * t to update only DNN1. This update law is similar to the generator's update of GANs, but this architecture represents a general form for various robotic systems. Through several iterations, we store the best hypothesis according to the optimization criterion min( O * t − m t ). Note that we do not need to know the relationship between the action vector VOLUME 8, 2020 and the task ''pick the bottle'' because the virtual robot will think and learn the concept by itself.

IV. HYPOTHESIS GENERATION MODEL-BASED CONTROL FOR VIRTUAL ROBOTIC CALLIGRAPHY-WRITING SYSTEM A. CALLIGRAPHY-WRITING ROBOT WITH HYPOTHESIS GENERATION NET
Chinese calligraphy-writing represents a big challenge for a robot if the coordinates are not prescheduled. Even with computer vision, it is still difficult to calculate 6-axis coordinates [x, y, z, roll, pitch, yaw] for the robot to write Chinese calligraphy. We know the relationship between 2D coordinates [x, y] and the Chinese calligraphy image by image processing, but the other coordinates [z, roll, pitch, yaw] are still difficult to design. It is therefore of significance to implement the proposed hypothesis generation model, so that a virtual robot can think and learn how to figure out the method of writing Chinese calligraphy. To prevent the time-consuming process in learning to write Chinese calligraphy in a real environment, we utilize a virtual robotic system [34] shown in Fig. 3 that we developed to simulate the process of brush writing. To write Chinese calligraphy characters in this paper, 5-axis reduced form X , Y , Z , θ, θ without the spin axis, instead of using 6-axis [x, y, z, roll, pitch, yaw] form, is utilized to describe the Cartesian coordinate [34], the rotation angle, and the tilt angle of the writing brush. This is because the writing brush seldom spins when writing Chinese calligraphy. The vector [X , Y ] represents the Cartesian coordinate, and [Z ] represents the vertical axis coordinate which renders Chinese characters being thick or thin. The vector θ, θ controls the rotation and tilt of the brush which significantly influences the Chinese calligraphy being aesthetic or not. Figs. 3(a) and 3(b) show a schematic diagram of the rotation θ and tilt θ, respectively, of a writing brush. Figs. 3(c) and 3(d) show a schematic diagram which illustrates the brush writing a character according to the coordinate X , Y , Z , θ, θ in the simulation environment.

B. CALLIGRAPHY NET MODEL
The architecture of the hypothesis generation model for a robotic calligraphy-writing system is shown in Fig. 4. Firstly, we utilize the fast thinning algorithm [35] to extract data from the strokes of Chinese characters from a human writing sample. Next, we split the original image into several region of interest (ROI) images in accordance with the trajectory of the stroke. The number of ROI images is chosen to be the number of skeleton points. Every ROI during the writing process corresponding to [X , Y ] is given by coordinates of a stroke. On the other hand, every coordinate Z , θ, θ corresponding to ROI image can be obtained by training the Writer Net. By using the coordinates [X , Y ] and Z , θ, θ , the writing results can be observed through the virtual robotic system. Then, we train Estimator Net by a simulative image written by the virtual robot to memorize and recognize the result from the virtual robotic system. Following that, we connect the Writer Net and Estimator Net as hypothesis generation net and lock Estimator Net to train Writer Net to minimize the loss between the original image and the image memorized by the Estimator Net. The learning process continues by alternating between k1 iterations for optimizing Writer Net and k2 iterations for optimizing Estimator Net. Keeping optimizing this training pattern until the simulative image becomes very close to the original image indicates that the robotic system has the ability to do better actions. Through the interaction between Estimator Net and Writer Net, they simultaneously progress to accomplish the hypothesis generation process. Therefore, we can obtain more accurate coordinates to write Chinese calligraphy. The loss functions of Writer Net and Estimator Net are respectively shown as: 2 (3)  (5) where R is defined as ROI, and l is the length of the trajectory of the strokes. C(·) is defined as a function which sorts skeleton data according to the writing direction. The function W (·) is the proposed Writer Net that outputs a 3-dimension coordinates Z , θ, θ according to the ROI image. Function S(·) is the virtual robotic system [34] which outputs the writing result according to the coordinates X , Y , Z , θ, θ . E(·) is the proposed Estimator Net that outputs an image according to the coordinates X , Y , Z , θ, θ . We utilize mean square error (MSE) to measure the performance of the writing result. Fitting Estimator Net E(·) to the virtual robotic system S(·), we have loss θ E as the mean square error between E(·) and S(·). Note that Writer Net needs to write calligraphy results as close as possible to the human writing sample. Thus, we have loss θ W as the mean square error between E(·) and R(·). Thus, the Estimator Net and the Writer Net can be updated by minimizing the loss functions.
To help readers better understand the implementation, a pseudo-code is included below to illustrate the overall processes of the proposed hypothesis generation model.

Algorithm 1 Training Process of Hypothesis Generation Net
Input: Input of human writing sample, R Output: A set of coordinate points, d Initialisation: Writer Net and Estimator Net with random weights 1: repeat 2: Step 1: 3: Input R to Writer Net to produce a set of coordinates d 4: Step 2: 5: Virtual robot writes calligraphy according to d in the simulation environment 6: Step 3: 7: for i = 0 to kl do 8: Update Estimator Net to approximate the network to the writing result by the virtual robot 9: end for 10: Step 4: 11: for i = 0 to k2 do 12: Update Writer Net so that output of Estimator Net approximates R 13: end for 14: until stop button is pressed

C. WRITER NET AND ESTIMATOR NET
The detailed architecture of the proposed Writer Net is shown in Table 1, which consists of eleven layers with weights. Writing samples as input to the Writer Net are 20 × 20 grey scale images. All the convolutional layers have 3 × 3 filters and ReLu activation. Downsampling is utilized after the convolution layers by a max pooling layer with a stride of 2. When the previous layer is a max pooling layer, the number of the feature map is doubled to extract the feature from the higher dimensional data input. The dropout layer is set to fifty percent. LSTM and RNN in Table 1 are performed because our input writing samples are the ROI images of the stroke image. These ROI images are related to each other since the writing process is continuous. Figs. 5(a) and 5(b) show the right-falling stroke and its trajectory, respectively. Fig. 5(c) shows the ROI blocks along with the trajectory of the strokes, which demonstrates that every image has strong relation and causality with the adjacent images. Furthermore, every coordinate Z , θ, θ which maps to each image should be smooth and soft changing. The angles of the brush cannot change drastically if the states are close. Then, LSTM and RNN are utilized to suppress the variation of Z , θ, θ . The architecture of the proposed Estimator Net is shown in Table 2, which consists of fourteen layers with weights. The input vectors are 3-dimensional coordinates Z , θ, θ . The convolutional layers also have 3 × 3 filters and ReLu activation. The transpose convolutional layers are utilized to upscale with a stride of 2. The dropout layer is also set to fifty percent. Then two fully-connected layers are utilized VOLUME 8, 2020 to extract features into final output of 400 nodes to obtain 20 × 20 images by reshaping the output.

V. SIMULATION RESULTS
We conduct our experiments on Intel Xeon CPU E3-1246 v6 of 3.70 GHz and NVIDIA GeForce GTX 1080 Ti with 32GB memory. To avoid spending too much time training the robotic arm to write Chinese calligraphy, we build a robotic simulation environment shown in Fig. 6 for our virtual robot to simulate the process of Chinese calligraphy writing [34], [36]. As shown in Fig. 6, the left picture box ''InputPicture'' shows the stroke of Chinese character written by a human. The middle picture box ''paper'' reveals the writing result of the virtual Chinese calligraphy-writing robot. The picture box ''angle'' shows the current state of the brush. The current 5-axis coordinates are also shown on the right side of Fig. 6. Except for InputPicture, all the other boxes update the status simultaneously when the simulation environment receives output of the Writer Net. The image of the Chinese calligraphy stroke captured by a webcam has a size of 200 × 200. We then convert the image into a grey-scale image as the input.  The experiment is conducted under Python 3.6 that utilizes Tensorflow backend with Keras library and NVIDIA CUDA 9.0 library for parallel computation. Mean square error (MSE) is utilized to measure the performance of the hypothesis generation net. We utilize root mean square prop (RMSProp) [37] to be the optimizer. Fig. 7(a) shows the eight ideal Chinese strokes of Chinese character 'yong' ( ). Figs. 7(b) and 7(c) show the training process of the eight strokes by the Writer Net and the Estimator Net, respectively. The images shown in Fig. 7(b) are drawn by the Writer Net which predicts the coordinates. Through the simulation system, the Writer Net emulates a similar image which the robotic arm could draw. Fig. 7(c) shows the images from the Estimator Net according to the coordinates provided from Writer Net. In the beginning, the Estimator Net generates images according to the coordinates far different from the Writer Net. Gradually, the results of the Estimator Net become more and more similar to the Chinese character written by the Writer Net. Therefore, the coordinates produced by the Writer Net become more and more similar to the ideal target, and this process simulates human's learning process. Firstly, a human generates a behavior based on the learning task as what the Writer Net does. Secondly, the Writer Net tries to imitate the human's writing. Next, the result is stored in the memory. Then, the Writer Net interacts with the Estimator Net to analyze this behavior similar to what the hypothesis generation net does. For the next time, while doing the same action, the human will retrieve the experience from the previous time and do a better action.
Combining some strokes, we are able to form a complete Chinese character. The writing results of Chinese character 'yong' ( ) are shown in Figs. 8(b)-8(e). Based on these strokes, four Chinese characters, i.e., yong ( , means permanence), tsun ( , means inch), da ( , means big), and jiang ( , means river) shown in Figs. 9(a)-9(d) are the images of human writing samples. Figs. 9(e)-9(h) show the simulation results written by the virtual robot.

Remark 1:
In the training process shown in Figs. 7 and 8, we found that there exist some variations in writing 'yong' at the 100th generation. Training process may overcorrect the writing because the Estimator Net probably forgets some of the past information if the model does not recall the past experiences for a long time. Then, the Writer Net makes new hypothesis according to the Estimator Net which forgets the past behaviors. That is, the Writer Net sometimes makes the same mistake because of forgotten memories. For example, the loss curve of the eighth stroke in the training process in Fig. 10 shows an undesired oscillating phenomenon. In the future, we plan to investigate memory systems to allow the hypothesis generation net with deeper impression about significant experiences, so that the performance of writing Chinese calligraphy can be improved.

VI. CONCLUSION
This paper presents a novel hypothesis generation model for a virtual robotic system to learn to write Chinese calligraphy through thinking and practicing according to a human writing sample. The proposed model has three main parts, i.e., the Writer Net, Estimator Net, and hypothesis generation net. These three models represent human's action, memory storage, and judgment. Through simulating human's psychological process, the hypothesis generation net has the ability to think and achieve some actions by itself. In our experiments, we design a Writer Net with 11 layers and an Estimator Net with 14 layers. The outputs of the hypothesis generation net are a series of coordinates for the virtual robot. Consequently, this cognitive framework is able to accomplish unsupervised image-to-action translation. Simulation results demonstrated in this paper confirmed the effectiveness and feasibility of the proposed hypothesis generation model to learn how to write Chinese calligraphy.
MIN-JIE HSU was born in Taipei, Taiwan, in 1993. He received the B.S. degree in electrical engineering from National Taiwan Normal University, Taipei, in 2015, where he is currently pursuing the Ph.D. degree with the Department of Electrical Engineering. His research interests include artificial intelligence, fuzzy logic systems, neural networks, and reinforcement learning.