A Hierarchical Computational Model Inspired by the Behavioral Control in the Primate Brain

The basic cognitive architecture of our brain is still unknown. However, scientists have found evidence for existence of distinct behavioral control systems shared by humans and nonhumans. Inspired by the problem solving systems of the behavioral control in the primate brain, a hierarchical computational model is presented. We focus on the integrative performance of brain substructures, each of which is represented by a problem solver that is further modeled by a certain algorithm. Different levels of brain substructures, as well as the corresponding algorithms, are hierarchically organized both in structure and in function, including how and when higher-order solvers control lower-order ones. Different problem solvers share a same slice of working memory. This novelty is claimed since most of existing brain models emphasize on the neural network structure even though the neuron dynamics of brain is still very controversial. And we compare its performance to three other computational models in the face of a challenging foraging problem. Agents are examined in foraging environment with different sizes, and/or transparent barriers. The experimental results show that our model performed the best outright in most scenarios. Further, the results discover that the virtues of our primate brain lie not only in the heights of thinking it can reach, but also in its range and versatility.


I. INTRODUCTION
Recently, more and more robotic research has been inspired by anatomical and psychological studies. Scientists have found there exist two neural systems in mammals controlling their behaviors. One is a model-based goal-directed system based in the ventromedial prefrontal cortex (vmPFC), and the other is a model-free habit system based in the striatum [1], [2]. It has been argued that a model-based decision-making system has potentials for high-order cognitive functions such as mental simulation, planning, and reasoning, which usually leads to better solutions to many problems. However, research shows that pure model-based systems are notoriously brittle and therefore often break under real-world conditions due The associate editor coordinating the review of this manuscript and approving it for publication was Stavros Souravlas .
to either inaccuracy of the model itself or uncertainty of the real-world [3]. A model-free habit system usually relies on a stimuli-reward mechanism, where a positive or negative reward responds to certain stimuli. Experiments have shown that such a simplicity is very efficient for dealing with uncertainties. That is, a fallback to a model-free habit system would be a solution to the brittleness and break of a model-based system. Another approach would be to have a better system that can fix the models when they fail, enabling to solve these harder problems. In the paper, a problem whose features can be properly modeled by a model-based problemsolving system is called as an apparent problem. Otherwise, the problem is called as a non-apparent problem, i.e., one that breaks the model and requires an individual to infer a hidden cause and create a non-perceptual concept to model it when confronting an unknown event or consequence.
In primates, evidence shows there exists a distinct region in prefrontal cortex (PFC), called granular PFC. Some researchers believe that the granular PFC enables primates to perform unconventional behaviors, such as looking away from a salient visual stimulus when necessary [4]. In accordance with this view, we hypothesize that the granular PFC is the base of solving non-apparent problems. Moreover, we believe that a detailed analysis of the region will shed light on the mechanisms that underpin creative problem solving in people.
To analyze the apparent versus non-apparent problem solving, we first focus on the classic detour problem in the literature [5], where subjects must circumvent a barrier to obtain a reward item. Most research [6], [7] has focused on scenarios with opaque barriers as obstacles, showing that agents can solve the problem by taking paths away from a goal item to reach it. However, the detour problem is found extremely challenging when the barrier is transparent. As shown in Fig.1, many nonhuman animals and human infants have difficulties to solve the problem, as they repeatedly attempt to reach directly for the reward item, even in face of strong and negative feedback [8]. Psychologists have tended to explain this insensitivity to negative feedback as an inability to inhibit a lower-level behavioral control system, for example, a Pavlovian system [2]. Very interestingly, when first given experience with an opaque barrier, nonhuman primates succeed to solve the transparent barrier problem, which suggests that the major difficulty does not stem from a lack of self-control and the difficulty may be when and how to activate the self-control. Furthermore, experiments on rhesus monkeys show their losing the ability of solving the transparent detour problem due to lateral PFC lesions [9], which in turn proves the granular PFC is the key to non-apparent problems.
In the transparent experiments, subjects fail because they do not readily see the transparent barrier. Although the response is negative, there is no apparent reason for it, and so they continue to attempt the most efficient solution for the goal item. This would be an example showing that a problem-solving system sees a clear solution and is therefore overriding contrary feedback. Therefore, we further concluded that the transparent obstacle detour problem requires a non-apparent solution. And the problem-solving system must reformulate the problem by including the transparent obstacle via inferring from the effect of being blocked instead of seeing the obstacle directly. And we believe that many mammals don't have a mechanism to solve such a non-apparent problem. In this article, we use the detour problem with a transparent barrier as an example of a non-apparent problem.
Inspired by the hierarchical behavior control system of the primates, this article proposes a hierarchical computational model to solve hard and non-apparent problems. The model consists of four basic levels of behavioral control in primates. The first is a problem-solving system based off the hypothalamus, the first main system in the vertebrate brain. The control system of the first level is responsible for attaining the goal when it is perceived. Thus, it is not explicitly modeled in our computational model. The second level is a model-free problem-solving system based off the striatum located in front of the thalamus. The system is essentially an action-selection mechanism. And we specifically model the level as a reinforcement learning system [10], which is consistent with other research [11], [12]. The third level represents a model-based goal-directed problem-solving system, which solves apparent problems by using a model built from well-defined environments. The fourth level is another model-based problem-solving system, which solves non-apparent problems. The highest level is able to fix the model of the third level by inferring from the effect of the non-apparent feedback, which usually is not directly perceived. The four levels are hierarchically organized and our initial results have been published in a conference [13]. To clarify, the extension of the work mainly includes the following. First, the model is multi-level and hierarchical. The hierarchy lies in different levels of abstraction in states, therefore the extended work is scalable to much larger problems. Second, a network among levels is theoretically presented and studied experimentally at the first time, showing how and when higher levels control lower levels. Last, transparent obstacles are included in the experiments and much more results have been conducted to show the advantages of the model compared with others. To clarify, a direct modeling of neural dynamics in a brain is beyond the study of the work.
In literature, there are many brain-inspired computational systems. In [14], a hierarchical model inspired by rodent medial prefrontal cortex was developed. The model studies how the anterior cingulate cortex (ACC) determines and motivates what tasks to perform. The model also reveals that a patient with the ACC lesion is less interested in engaging in creative activities. The model is essentially implemented in a hierarchical reinforcement learning framework and uses three levels of abstraction of choices. By comparison, our method modeled more high-order cognitive functions, such as planning and reasoning, which has a potential to solve more difficult problems. In [15], researchers present an agent-based computational framework where a specific brain area is modeled by an autonomous agent, mimicing the special features of the area. And each agent is further modeled by a neural network system. A hierarchical co-evolutionary method is used to train the agents. Some researchers [16] developed a hierarchical multi-timescale recurrent neural network model to study how higher-order cognitive mechanisms may emerge. These models are both studied in a level of neurons, instead, we study the brain cognition in a functional level as the neural dynamics and the pathways among brain areas still remain controversial. In [17], [18], researchers reviewed neurocomputational models of working memory and concluded that computational models are very helpful to explore various cognitive mechanisms.
The rest of the paper is organized as follows. In section II we formally propose the computational model. Various levels of the model, connectionism among levels, and hierarchy are detailed in the section. In section III we present experimental scenarios for evaluating the performance of the model. Discussions and conclusions are made in the last section.

II. COMPUTATIONAL MODEL
We present a four-level computational model and each level could be viewed as an independent decision-making system. The four-level system represents the low to high cognition in the brain. Low level systems normally only need few knowledge of the world to make a decision, thus they are fast. Higher level systems require much more knowledge and are likely to have better solutions. The performance of the model is evaluated for a foraging problem in a 2D grid world, where a testing agent must find a path from its start location to a goal location as illustrated in Fig. 1. The detailed model description is stated below, so are switching mechanisms among levels.
A. MODEL DESCRIPTION Level 1 enables actual goal attainment once in view. For foraging, it represents the act of food assumption. Thus, it is not explicitly modeled in the scenarios. In the brain, it is the first behavioral control system based on hypothalamus.
Level 2 represents a model-free problem-solving system based on striatum. Level 2 is modeled using a Markov Decision Process and Reinforcement Learning (RL) framework [11]. Thus, the world is represented by a set of states S, where s t ∈ S and s t is agent's state at the time step t; An agent in the world selects an action, a t ∈ A(s t ), where A(s t ) is the set of possible actions available in state s t . At next time t + 1, as a result, the agent receives an immediate reward, r t+1 , and transits it to a new state, s t+1 . A mapping from a state s t to each possible action a t is called the agent's policy, π (s t , a t ), representing the probability of taking action a t when in state s t . The agent seeks to find an optimal policy π * that leads to the maximum expected discounted future reward. For the foraging problem described above, (x,y) coordinates represent states of the world. In a state except the goal state, there are four actions: move up, move down, move left, and move right. We must point out that the agent may not able to identify the coordinate. Level 2 uses the following Q-learning algorithm [19]: where Q(s t , a t ) is the learned action-value function of taking action a t when in state s t ; α is the learning rate α ∈ [0, 1] (the higher value α is, the faster the agent learns; however, a larger α would lead to a suboptimal solution.); γ is the discount rate γ ∈ [0, 1] and it determines the present value of future rewards (if γ = 0, the agent is only concerned with maximizing immediate rewards; as γ approaches 1, the agent considers future rewards more strongly-the agent becomes more farsighted); and max a Q(s t+1 , a) is the maximum Q value of taking the action a when in the next state s t+1 . This one-step Q-learning approximates the optimal action-value function Q * , independent of the policy used. A softmax action selection approach, Boltzmann distribution, is used to balance exploration and exploitation in Q-learning. It chooses action a on the time t with probability where τ is a positive parameter called the temperature.
After the learning procedure converges, the optimal policy is achieved by following the maximum Q * sequence. As you can see, rather than having an explicit model of the world, i.e., an understanding of how the states relate to each other in the grid world, level 2 sees the states independently, making decisions mainly based on the action values Q(s, a) at each state. Level 3 represents model-based apparent problem-solving. In a grid world, apparent problems mean that information in the world is well defined and can be directly perceived without confusion. For any novel problem in the grid world, the problem solver cannot see the entire problem immediately -the world is too large -and so a cognitive model must be developed via initial experience with each state. The problems are considered as multiagent problems or stochastic games, where there are three types of agents: self, others and goals [20], [21]. Formally, the multiagent problem is defined as a 4-tuple {S, A, P, R}. S is a set of states of the world; A is a finite set of agent actions i; P(s |s, a i , a −i ) is the state transition probability function meaning the probability of moving from state s to state s by taking action a i by agent i and by taking actions a −i by all other agents; R is the reward function for agent i. This model thus has a clear understanding of the relationships among the states which Level 2 cannot see. For the foraging problem in the grid world, the cognitive model consists of four components: (1) (x,y) coordinates of grid world that can be perceived and identified by the agent represent states of the world; (2) the set of available actions; (3) the state transition probability; (4) the identification of apparent obstacles. For current study, obstacles are static and there is only one action of obstacles: blocking. As stated in previous section, Level 3 can only see opaque obstacles. Level 3 uses a Win-or-Learn-Fast Policy Hill-Climbing (WoLF-PHC) algorithm that updates Q-functions with the Q-learning rule as in equ. 1 and policies with a WoLF rule [22]- [24] as follows.
where |A| denotes set cardinality; δ i,t denotes the learning rate and is updated as follows.
The agent is winning if π i,t (s t , a t ) >π (s t , a) and losing otherwise, whereπ (s t , a) is the average policy of the agent i at the state s t . And δ l > δ w . By using the variable learning rates, it allows the agent to learn quickly when losing. Level 4 represents the system that attempts to find these non-apparent solutions. For the current study, a hidden cause occurs, i.e., a transparent obstacle, blocks the direct path to the goal. Such an obstacle is literally non-apparent to Level 3. A non-apparent problem could be generalized as a partially observable Markov decision process (POMDP) [25]. Similar to Level 3, the agent builds the cognitive model of the world via its experience with the world. Differently, it learns hidden causes by inferring the feedback from the environment. For example, when it is blocked by invisible obstacles, it uses this information conceptually, inferring that there is an ''agent'' doing the blocking. In this way, it can build a fairly complete cognitive model of the world. Level 4 currently uses the planning algorithm A * (called 'A star') to find an optimal path around the obstacles to the goal [26]. In the paper, words, non-apparent, transparent and invisible, are interchangeable.

B. LEVEL SWITCHING MECHANISM
Four levels work concurrently and cooperatively. Fig. 2 shows the connection among different levels. In a whole system, the control flows from up to down. A higher level can inhibit a lower level. Level 4 monitors level 3 to determine if it needs to take control; Level 3 monitors level 2 to determine if it needs to take control as well. Level 2 acts as an actor via interacting with the world and as a result updates action-value function Q(s,t) under certain policy. Q values are the working memory of the overall model. In a case in which level 3 or 4 needs to take over the system, the working memory will be updated on the fly as a consequence. After that, the control is returned back to level 2 and the updated Q values will be used. Q-learning in level 2 is able to converge an optimal solution for a single-agent MDP, however, it doesn't guarantee to converge for multiagent cases, for each state is stochastic and determined by joint actions of agents. As shown in Fig. 2, level 3 monitors the 'overflow' threshold which accumulates when other agents' actions prevent current agent from converging the optimal policy. Whether other agents' actions confuse the current agent can be told by equ.5 Algorithm 1 Algorithm of Our Four-Level Computational Model Input: learning rates: α, δ w , δ l ; thresholds: n T 2 , n T 3 ; initialize, Q(s, t) ← 0, ∀s, n s t ← 0 Output: optimal policy: π * (s, t) 1 build or update a map of the world; 13: run A * and update Q(s, t) on the path; 14: switch to level 2: i lv = 2; 15: until (s goal is reached) 16: Execute level 1. 17: procedure Check-if-overflow(n T j , n level ) 18: i lv = 2; 19: n T = n T j ; 20: if r t+1 (s t , a t ) = r t (s t , a t ) then n s t = n s t + 1 21: end if 22: if S n s t ≥ n T then 23: i lv = n level + 1; 24: reset: ∀s, n s t ← 0; 25: end if 26: return i lv 27: end procedure where, n s t denotes the number of confusion; r t+1 (s t , a t ) denotes the reward of the current agent received at state s t at time t + 1 by taking action a t ; and r t (s t , a t ) is the reward at time t. Level 3 takes over control of level 2 if equ. 6 is satisfied, where S n s t denotes the number of states confused and n T denotes the value of threshold. S n s t ≥ n T When there appears non-apparent obstacles, level 3 gets confused. Level 4 takes control of level 3 just as level 3 takes control of level 2, more specifically, level 3 gets confused when it gets trapped due to non-apparent obstacles or local minima [27] that prevents the agent finding an effective path to the goal. The overall algorithm is presented in Algorithm 1.

C. HIERARCHY
In a large grid world, the state number could be huge. The action space also contributes to learning complexity. For an action space of size m, the number of state-action pairs is m times of the state number. An efficient approach to reduce computational complexity is to represent states abstractly. For the foraging problem, state space abstraction using cell decomposition [28], [29] is used. Level 4 runs on the reduced state space. Another hierarchy lies in various levels of action abstraction. As shown in Fig. 3, in prefrontal cortex, metaoption selection is implemented. An example metaoption action in the foraging environment is to go to certain space extracted based on the state space abstraction rules. In neurocortex, option selection is made-so as to maintain the higher metaoption selection. For example, going to space 1 may require an option of going around certain obstacle. In striatum, primitive actions specific to the option will be selected. To complete an option of going around obstacles requires actions such as going right, going up, and so on. Perception is also hierarchically processed in the brain [30], [31]. Perception signals are processed in low-level brain areas to extract low-level features such as dot, line, and etc., which are inputs of models of levels 1 and 2. High-order brain areas are able to extract high-order non-apparent signals that are inputs to Level 4.

III. EXPERIMENTAL RESULTS
We compared our four-level model to the following models: 1) Model 1: consisting of levels 1 & 2 2) Model 2: levels 1, 2, & 3 3) Our model, model 3: levels 1, 2, 3, & 4, that is, Model 2 plus Level 4 All models assume the existence of Level 1. Model 1 simply uses model-free reinforcement learning, probably representing an ancestral vertebrate. Model 2 combines model-free RL with the ability to solve apparent problems, perhaps representing the ancestral mammalian brain. Model 3 is our four-level model of the primate brain. The models are simulated and compared in grid worlds as shown in Fig. 1. We examined the effects of (1) grid world size, (2) number of obstacles, and (3) invisible obstacles. Two measures  were used: (1) cumulative number of steps to reach the goal, N r , and (2) cumulative computational cost, C r . They were combined as the overall performance score, P, and P = k 1 N r + k 2 C r , where k 1 and k 2 are trial-and-error parameters, and k 1 + k 2 = 1, k 1 > k 2 . Each run has 600 trials and a complete experiment includes 50 runs. The experimental data is the average of the 50 runs.

A. GRID WORLD SIZE
The dimension is used to test how the size of the world may affect the performance of different models. Three different grid world sizes are tested in our experiments, a small world of 10 × 10, a medium world of 20 × 20, and a large world of 40 × 40. As shown in Fig. 4, model 1 initially spent a lot of time to explore and learn the world, thus converged to the optimal solution slower than models 2 and 3 which planned paths with the model learned during exploration.  Since the world is small and free of obstacles, level 3 works in the very early trials and level 4 barely works at the whole run as in Fig. 4 (d). Thus, performance of model 2 and 3 is very close to each other. As we know, the faster the agent finds the solution, the better chance it has to survive. These characteristics hold same even in a larger obstacle-free world as shown in Fig. 5. Of course, the larger world makes models longer time to converge.

B. NUMBER OF OBSTACLES
To examine how number of obstacles could affect the performance, we included 500 obstacles in the above large world. A picture of the world is shown in Fig. 1. In Fig. 5 (d), our model shows that Level 4 plays a very important role in the first 100 trials and helps it converge much faster as shown in Fig. 5 (a). That is, Level 3 of model 2 got troubled in the early stages. We found a major reason is that the cognitive map of the world built in the early stages is incomplete, and the incomplete map together with large amount of obstacles forms some local minima which trap Level 3 and make it downgraded to Level 2 instead. When the map becomes complete at later stages, the optimal Q values on the path planned by models 2 and 3 are updated. After that, Level 2 takes the control with few helps from higher levels.

C. INVISIBLE OBSTACLES
To further examine how non-apparent obstacles-invisible obstacles-could affect the performance. We compared tests on a medium size world with 80 apparent obstacles with tests on the same world but with 50 invisible obstacles more. The results for the former are shown in Fig. 7, and the later in Fig. 8. In the former case, Level 3 shows the validness to solve it, so Model 2 and Model 3 are almost equivalent (as Level 4 in Model 3 is inactive in most of time.). As a comparison, in Fig. 8 (d), Level 4 is activated almost at the whole 600 trials, indicating that Level 3 is completely not valid to solve the transparent problem. Hence, our model showed much better performance again.

IV. CONCLUSION
Multiple levels of computational algorithms are used to represent the complexity of the various levels of brain areas. The experiments show our model is able to improve the problem solving of the agent, especially in non-apparent scenarios. Our model does so because it can update the broken model by inferring hidden cause. And this may give us a hint of how humans might solve difficult problems. Our research also shows that a pure system, either a model-free method or a model-based method, is not an efficient way to solve most problems. Besides, Various levels of problem solvers in our model share a same slice of working memory, which is also consistent with the biological facts. However, we never meant to conclude that the different brain areas only rely on the computational model stated in the paper. The complexity of a primate brain is far more than what is presented in the paper. The major purpose is to present a way in which the primate brain might be organized.
JERALD KRALIK received the B.S. degree in zoology from Michigan State University and A.M. and the Ph.D. degrees in psychology from Harvard University. He is currently a Visiting Professor with the Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST). Before the position, he was an Assistant Professor with the Department of Psychological and Brain Sciences, Dartmouth College. He also completed post-doctoral positions in behavioral neuroscience with the Duke University Medical Center and the National Institute of Mental Health. His research interests include animal cognition and behavior, cognitive neuroscience, and brain engineering.
HAIYAN MI received the B.S. degree in polymer materials and engineering from Zhejiang SCI-Tech University, Hangzhou, Zhejiang, in 2001, and the M.S. degree in business management from Florida State University, in 2006. She is currently a Lecturer with the Yiwu Industrial and Commercial College. She has published more than ten articles. Her research interests mainly include mathematical modeling and optimal control in financial systems. VOLUME 8, 2020