Optimal Control of Iron-Removal Systems Based on Off-Policy Reinforcement Learning

The goethite iron-removal process is an important procedure to remove the iron ions from the zinc hydrometallurgy. However, as a coherent system with complex reaction mechanism, associated uncertainties, and interconnected adjacent reactors, it is difficult for the process to accurately control the ion concentration. Because a large amount of historical data can be obtained during the process, an optimal control algorithm based on off-policy reinforcement learning is proposed in this paper to overcome these difficulties. According to the historical data, the weights of neural network are learned offline, and the optimal control strategy is solved online. Firstly, a bounded function is introduced to define the maximum effect of the coherent system on the subsystem cost function and to extend the cost function of the nominal system, so that the decentralized guaranteed cost control problem can be expressed as the optimal control problem of the nominal system. Then, an approximate iterative control algorithm based on actor-critic structure is proposed. The actor and critic neural networks are used to approximate control strategies and cost functions respectively. To achieve complete off-line, a new neural network is added to the actor-critic structure to approximate a part of the unknown system structure, and the three neural network parameters are optimized by the state transition algorithm. Finally, the strategy update and strategy iteration operations are performed alternately to learn optimal control strategies. The effectiveness and flexibility of the proposed off-policy optimal control method is validated by data from a real industrial goethite iron-removal process.


I. INTRODUCTION
Zinc is an important non-ferrous raw material, which plays an important role in various fields. It is widely used in nonferrous metallurgy, batteries, machinery, automobile manufacturing and other industries. At present, most zinc smelting enterprises adopt the atmospheric pressure oxygen enriched direct leaching zinc smelting method with high iron zinc sulfide concentrate as raw material [1], [2], which can effectively reduce sulfur dioxide emissions and improve the recovery rates of valuable metals in leaching solution. Because the zinc sulfide concentrate is rich in iron, the leaching solution will contain high concentration of iron ions. If the iron ion concentration in the leaching solution exceeds the range of The associate editor coordinating the review of this manuscript and approving it for publication was Mauro Gaggero . the technical requirement, it will lead to impurity of the zinc product, and it will increase power consumption in the electrolytic process. Therefore, the goethite iron-removal process is an important link in the leaching process. In the goethite iron-removal process, oxygen is added to oxidize the ferrous iron into ferric iron. Then, the ferric iron is hydrolyzed to goethite precipitate for iron-removal. In the process, excessive oxygen is generally replenished into the rector for the required range of iron ion concentration. This approach will cause waste of raw materials and sharp fluctuation of pH value, as well as low grade of goethite precipitate or even no goethite precipitate. Therefore, in order to ensure the quality of iron-removal and save raw materials, controlling the oxygen addition rate is a critical step.
Goethite iron-removal process consists of four continuous reactors arranged in descending order, overflowing zinc leach solution from one reactor to the next. It is a non-linear process involving a series of complex chemical reactions such as oxidation, hydrolysis, and neutralization. For the non-linear problems in the industrial production process, a lot of effective control methods have been proposed [3]- [10]. According to the law of mass conservation and reaction kinetics, Chen et al. [4] established a single-tank continuous stirred reactor model based on the law of conservation of mass and reaction kinetics for the first time. Based on the single-reactor model, considering the existence of unreacted oxygen in the leaching solution, a cascade reactor coupled control distributed model is established. The distributed model predictive control strategy is adopted to solve the optimization control problem of iron-removal process. Xie et al. [5] established a weighted coupled CSTR model for goethite iron-removal process by introducing weighting parameters. A parameter identification method to determine unknown parameters is proposed. Then, a model predictive control scheme is designed to achieve process performance objectives and minimize process costs. Han et al. [6] and others transformed the dynamic optimization problem of the iron-removal process into a nonlinear mathematical programming problem and proposed a multi-objective optimization method based on the state transition algorithm and constrained non-dominated sorting to find the optimal solution of the oxygen concentration and zinc oxide addition. Sun et al. [7] proposed a steady-state multiple reactors gradient optimization, unsteady-state operational pattern adjustment strategy, and a process evaluation strategy based on the oxidation-reduction potential and proved the effectiveness of this study in industrial experiments. Yang et al. [8] proposed a model-based hybrid adaptive dynamic programming (ADP) framework consisting of continuous feedback-based policy evaluation and policy improvement steps as well as an intermittent policy implementation procedure. Shahlavi et al. [9] developed a novel fully distributed controller based on backstepping technique and neuro-adaptive update mechanism. The simulation results are carried out to demonstrate the effectiveness of the proposed approach. And literature [10] presents a distributed solution for consensus control of a network of single-integrator incommensurate fractional-order systems with nonlinear and uncertain dynamics. However, most of the industrial processes have complex environment, and the mechanism model cannot reflect the system dynamics completely and truly, and there must be modeling errors, unmodeled dynamics and various uncertainties, which makes the model-based control theory and method defective.
In order to solve a series of problems caused by imprecise model, many researches put forward data-based control methods [11]- [13]. Different from other data-based control methods, a significant advantage of reinforcement learning based control method [14]- [23] is that it can achieve performance optimization control in unknown environment, which is undoubtedly of great significance to practical engineering applications. Because of this, many researchers introduce reinforcement learning into optimal control problems. Li et al. [18] proposed a novel off-policy interleaved Q-learning algorithm for solving optimal control problem of affine nonlinear discrete-time systems, using only the measured data along the system trajectories. Lewis and Vamvoudakis [20] proposed an online strategy iteration algorithm based on reinforcement learning. The algorithm uses an actor-critic network architecture and adjusts the weights of the actor network and the critic network synchronously. Yang et al. [21] proposed a novel barrier-actor-critic algorithm that is presented for adaptive optimal learning while guaranteeing the full-state constraints and input saturation. Luo et al. [22] proposed a data-based approximation strategy iteration algorithm, which updates the weights using the weighted residual method based on the least square. Yan et al. [23] proposed a Q-learning algorithm based on policy iteration, and proved that under the given bounded condition, the approximate q-function would converge to the finite neighborhood of the optimal Q-function. Zhu et al. [25] transformed the H∞ optimal control problem into the zero-sum game problem and obtained the H∞ optimal control law of the perturbed system by using the strategy iteration algorithm. Dong [28] studied the event triggered iterative ADP method and applied it to the optimal control of grinding process [29]. Maldonado et al. [30] et al. studied the optimal control of flotation cell by using adaptive dynamic programming method. This method can learn from the operation data of flotation cell and improve the controller iteratively. In recent years, some scholars also extended RL method to decentralized control. Liu et al. [31] used an online learning optimal control method based on neural network, put forward a decentralized control strategy to stabilize a class of continuous time nonlinear interconnected system, and designed the optimal controller isolated from the system by using the cost function reflecting the interconnect boundary. Wang et al. [32] proposed a learning based optimal control method for the optimal control of interconnected systems. By combining the robust decentralized control formula with adaptive critical learning technology, the decentralized guaranteed cost [33], [34] controller was designed. This method is still implemented under the condition where the model is known. At present, the modeless RL method for nonlinear continuous time decentralized control is still an open problem, which also promotes the research of this paper.
In this paper, the decentralized control problem of continuous time nonlinear system with unknown model is considered. Inspired by Wang et al. [32], this paper introduces a bounded function to transform the decentralized guaranteed cost control problem into the nominal system optimal control problem. Based on the literature [32], a reinforcement learning based optimal control algorithm is proposed. Three neural networks are used to approach the critic network, the actor network and a part of the unknown system structure respectively. For the problem that the traditional solution method, such as the minimum residual method [22] doesn't work when the linear relationship between the residual and VOLUME 8, 2020 the parameters is not satisfied, the three neural network parameters are optimized by the state transition algorithm [35], and the optimal control strategy is learned when the system model is unknown.
The rest of this paper is arranged as follows. In the second section, the decentralized optimal control problem is described. In the third section, the iterative algorithm of decentralized control based on data is derived. In the fourth section, through the simulation experiment of real industrial data, the applicability of the decentralized control strategy is verified. A brief conclusion is given in Section 5.

II. DESCRIPTION OF THE OPTIMAL CONTROL OF IRON-REMOVAL SYSTEM
The object of goethite process is zinc sulphate solution obtained by direct leaching of zinc concentrate. Usually, the iron-removal needs to be performed in the temperature range of 65 • C to 82 • C. The goethite process in a representative Chinese zinc smelting plant is taken to investigate, the simplified flow chart is shown in Fig.1. The iron-removal is completed slowly in the continuous stirred reactors. The zinc sulphate solution from the previous procedure, which enters the 1# reactor and can be called as inlet solution, would flow out of reactor 1# as the outlet solution and flow into reactor 2# afterwards. Similarly, the outlet solution of the previous reactor is the inlet solution of the next one. Moreover, the outlet concertation of Fe 3+ , Fe 2+ and pH in each reactor should be controlled with in the set range, making the excessive iron-ion less than 1g·L −1 , and the pH value between 3.0 and 4.0. Therefore, the zinc sulphate solution leaves the 4# reactor is qualified for the next procedures. In addition, partial zinc sulphate solution exiting in the last reactor is sent back to 1# reactor as the backflow solution, since the crystal nucleus of goethite in it can promote the proceed of iron-removal.
In the process, each reactor has a series of complex chemical reactions among gas, liquid and solid. From the aspect of the influence on iron-removal, the main chemical reactions are presented as follows: Oxidation reaction: Neutralization reaction: Hydrolysis reaction: In the oxidation reaction, Fe 2+ is oxidized to Fe 3+ that hydrolyzes to form goethite precipitation, which can be removed through filtration. And the neutralization reaction ensures certain reaction conditions, i.e. The pH value of the solution is kept within a certain range.
In the actual goethite process, data sampling can only be carried out every two hours because of the sealed reactors. Under this condition, those process data obtained by periodic sampling cannot be directly used in continuous time optimal control. Therefore, it is necessary to establish the mechanism model of the iron-removal process for optimal control.
Assume the temperature in the reactor remains unchanged and the solution is mechanically stirred uniformly. Then the rates of the oxidation, hydrolysis, and neutralization reaction in the solution can be obtained from the chemical reaction kinetics: where k 1 , k 2 , k 3 are the reaction rate constants, α, β, γ are the reaction orders. C O 2 is the concentration of dissolved oxygen in the solution and an important variable that affects the oxidation rate of the ferrous ions. Parameter m is the mass of zinc oxide, ρ represents the density of zinc oxide particles and R s is the radius of zinc oxide particles.
Simulation based on the CSTR model of goethite process is proposed in reference [5]. In the research, the control variable available to the controller is oxygen. According to the reaction rate equation, the dynamic equation of dissolved oxygen and the law of mass conservation, the ion concentration and dissolved oxygen in the solution at the outlet of the reactor are taken as states. The model of a single reactor can be described as follows: where F is the flow rate, V is the reactor volume. Parameters C Fe 2+ ,in , C Fe 3+ ,in and C H + ,in are the inlet concentrations of Fe 3+ , Fe 2+ and H + , respectively. Parameter m zno represents the mass of zinc oxide, and C O 2 is the dissolved oxygen concentration. The dissolved oxygen concentration is selected as a new state variable, where ρ, R S , V and ρ O 2 are constants, and F, C Fe 2+ ,in , C Fe 3+ ,in , C H + ,in , C Fe 2+ , C Fe 3+ , C H + are obtained from the sampled data. The reaction rate constants k 1 , k 2 , k 3 and the reaction orders α, β, γ are the parameters to be obtained.
In the actual goethite process, the ion concentration range at the inlet of the reactor is shown in the table below: According to literature [4], normally the value of reaction rate constants α, β, γ can be 1. In this paper, therefore, the value of α, β, γ is set 1.
However, even if only the key variable set of the goethite process is considered, the interaction between these variables makes the solution of the parameters still a considerable challenge. Therefore, a certain degree of modelling accuracy is often sacrificed in practice, which directly affects control accuracy.
Moreover, the goethite system is a coherent system. The solution in the former reactor will flow into the latter reactor, hence the ion concentration at the outlet of the former reactor is equal to the ion concentration at the inlet of the subsequent reactor. Consider the j# reactor as subsystem j, define the state of subsystem j as x j (t) = [c j,Fe 2+ , c j,Fe 3+ , c j,H + ] T , and u j = c j,o 2 as the control variable of subsystem j. To obtain the optimal control when the parameters are difficult to solve, it is assumed that the goethite process in any reactor can be expressed as the state space in equation (8) referring to equation (7).
The system functions f j (·) and g j (·) are both differentiable, and h j (x(t)) represents the concentrations between subsystem j and other subsystems. System functions and concentrations have two forms in the iron-removal process. For the 1# reactor: where C 1,in , C 2,in and C 3,in represent the Fe 2+ , Fe 3+ and H + concentration of the solution at the inlet of the 1# reactor, respectively. F b represents the backflow rate of the 4# reactor. For 2#-4# reactors: For further study, if the concentrations are not considered, the nominal subsystem of subsystem (8) can be defined as follows: Hypothesis 1 [32]: Assume the concentrations of three ions Fe 2+ , Fe 3+ , and H + in subsystem j are within the boundary given in TABLE 1, and satisfy the following structure: where D j (·) ∈ R n j ×r j and ζ are function structures of concentrations, and there is ζ (0) = 0. c j ∈ R r j is the uncertainty function of the concentrations, with c j (0) = 0. d j (·) ∈ R r j is a known bounded function with d j (0) = 0, j = 1, . . . , 4. For the nominal system (11), the cost function of subsystem j can be expressed as: where Q j (x j ) is a positive definite function and R j = R T j > 0 is a square matrix.
The objective of optimal control for goethite process is: give an initial state x j0 , design an approximate optimal control strategy u j (t) = u * j (x) to make the local subsystem j asymptotically stable and minimize the cost function (14). The optimization control problem can be described as:

III. DECENTRALIZED OPTIMAL CONTROL OF IRON-REMOVAL SYSTEM BASED ON REINFORCEMENT LEARNING A. APPROXIMATE ITERATIVE ALGORITHM
From the former section, it is known that the cost function (14) cannot directly evaluate the coherent system. To solve that problem, the optimal guaranteed cost control problem of the original coherent system (8) is converted into the optimal feedback control problem of the nominal system (11), according to the idea in [17]. Lemma 1 [32]: Assume that there are a cost function V j (x), a bounded function B j (x) with B j (x) > 0, and a control law u j (x), hence: (17) where ∇V j (x) is the partial derivative of the cost function V of the subsystem j to the system state x, ∇V j (x) ∂V j /∂x. Then a neighbourhood of the origin system (8) is locally asymptotically stable. Also, is the modified cost function of nominal system (11) described as: To deal with interconnections, B j (x) is set to a specific form as: For the new cost function (18), the Hamiltonian equation of the nominal system (11) can be defined as: Set the Hamilton function as H j (x j , u j , ∇V * j ) = 0, the optimal control of the HJB equation can be obtained: The modified HJB equation can be written as: Therefore, the optimal guaranteed cost control problem of the original coherent system is transformed into the optimal feedback control problem of the nominal system. The optimal control strategy (21) depends on the solution of the HJB equation (22), and the equation can be successively approximated by the GHJB sequence as follow: Since the model of the actual goethite process is not completely accurate, R j and g j (x j ) in equation (24) are often not precisely obtained. In order to obtain the optimal control strategy when the model is not accurate enough or the model is unknown, an approximation strategy iterative algorithm is proposed. In the strategy, the actual system data is used to learn the solution of the HJB equation through neural network learning. For that purpose, system (11) can be rewritten as: For system (25), the derivative of V (i+1) (x) with respect to time can be found as: Using equations (23) and (24), equation (25) can be written as follows: Both sides of (27) on the interval [t, t + t] can be integrated as: where V (i+1) j (x) and u (i+1) j (x j ) are unknown functions and unknown vectors of subsystem j, respectively. The problem of solving the GHJB equation (23) for V (i+1) j (x) is transformed to the problem of solving equation (28).
Lemma 2: Note λ(x) ∈ R m , b(x) ∈ R and c ∈ R m , in which c is the variable. If ∀c = 0, there is λ T (x)c = b(x), then λ(x) = 0 and b(x) = 0. When c is given a fixed value that satisfies c 0 = 0, then there is λ T (x)c 0 = b(x). Summing up the above analysis, the equation in (29) can be obtained.
Proof: Rewrite equation (28) as: According to Lemma 2, there are: By observing equations (33) and (34), it is easy to find that they are exactly the same as (23) and (24), respectively. Accordingly, the proof is completed.

B. ACTOR-CRITIC NEURAL NETWORK AND ITS PARAMETER SOLUTION
In order to solve Eq.(28) for V (i+1) j (x) and u (i+1) j (x), a method based on actor-critic neural network (NN) structure is adopted. Combining the advantages of both actor-only and critic-only, actor-critic neural network has low variance and continuous action. The actor neural network is used to approximate the cost function V The outputs of the critic NN and the actor NN are given by: j,u l ,L u ] T are the weight vectors of critic and actor neural network, respectively.
According to [22], equation (36) can be rewritten as: Define residuals as: In order to eliminate the dependence of the interconnecttions on g j (x j ), based on the actor-critic neural network, a structural neural network is constructed as the follows:q (i) where ∀i = 0, 1, 2 . . . , θ j,q,LA ] T is the vector of linearly independent activation functions for the structural VOLUME 8, 2020 neural network, then the residual can be expressed as: . In order to solve the unknown parameters in equation (38), the parameters of the neural network are first obtained by solving the following objective function: j is the residual defined in equation (40), and θ j = [θ j,V , θ j,u , θ j,ρ ] are the weight vector to be identified by the critic neural network and the actor neural network.
It is necessary for the commonly used methods such as the minimum residual method [22] to satisfy the linear relationship between the residuals and parameters defined by the HJB equation, when it comes to solving optimization problems. However, there are many optimization variables in the optimal control problem of goethite process, and it is difficult to meet the constraint where the residuals and parameters must have a linear relationship. In considering the above problems, an intelligent global optimization, algorithm-State Transition Algorithm [35] (STA), is used to optimize the solution parameters.
The parameter to be identified in equation (41) is encoded as the statex, and the process of parameter optimization by the state transition algorithm can be expressed as follows: wherex ∈ R n represents the state of the parametric solution, k is the number of iteration steps, andỹ k represents the fitness of the statex k .Ã k andB k indicate the state transition matrices at each update of the solution state.ũ k is a function related to the current statex k and historical statex k−1 , whilef (·) is regarded as the fitness function corresponding to the statẽ x k . The state transition algorithm generates random iterative solutions through four operators: a rotation transformation operator, a translation transformation operator, a telescopic transformation operator, and an axis search operator.

1) ROTATION TRANSFORM OPERATOR
x(k + 1) = I n + α 1 n ||x(k)|| 2 R r x(k) where α is the rotation transformation operator of the STA, and it usually takes a positive integer; n is the dimension of the solution state. R r ∈ R n ×n obeys the uniform distribution of [-1,1]. The rotation transformation of the statex(k) is performed in the hypersphere with its current value as the center and the rotation operator α as the radius.
2) TRANSLATION OPERATOR where β is a positive integer, which is the translation operator of STA. R t ∈ R n ×n obeys the uniform distribution among [0,1]. The translation of statex(k) is performed in a gradient direction ofx(k) tox(k − 1) with a maximum step size of β .
3) SCALING OPERATOR where γ is a normal number, which is a scaling operator of STA; R e ∈ R n ×n is a diagonal matrix obeying a Gaussian distribution. The scaling operator can be optimized across the entire search space.

4) AXIS SEARCH OPERATOR
where δ is a coordinate search operator of the STA, and its value is a positive integer. R a ∈ R n ×n is a diagonal sparse matrix, which has only non-zero elements at a random position, and the elements obey Gaussian distribution. After the parameters of the critic and actor of the neural network are obtained, the final control strategy u can be solved according to the weights, basis functions and the current state of each subsystem to minimize the cost function. Combining the approximate iterative algorithm and STA algorithm proposed in this paper, the steps to solve the control strategy of the associated iron-removal system are as follows: Step 1: Under the given initial stable controller and initial state, collect sample data of ion concentration at the reactor outlet for a period of time; Step 2: Select the basis functions for critic NN, actor NN, and the structure NN, then encode the weights to be identified as the states in the STA algorithm; Step 3: Select the state of a set of solutions that make the fitness functionf (·) (that is, the objective function (41)) reach the minimum value from the current population. Record it as best and the corresponding fitness is f best , then copy best as the number of individuals with SE. The population is recorded asx(k), and a new population is obtained by performing a scaling transformation according to equation (45).
The optimal individual in the population after the scaling transformation is new best , and the corresponding fitness is g best . If g best is less than f best , then use equation (44). Perform a translation transformation on the individual new best , and update the best and f best after the translation transformation.
Step 4: Copy the best into a group with S individuals, and then perform rotation transformation according to equation (43) to obtain a new population. Select the best individual new best in the population after the rotation transformation, and the corresponding fitness is g best ; if g best is less than f best , perform translation transformation according to equation (44), and update the best and f best after the translation transformation.
Step 5: Copy best into a group, and SE is the number of group. Then perform coordinate search and transformation according to equation (46). Select the solution state of the optimal solution among all individuals after transformation as new best , and the corresponding fitness as g best ; if g best is less than f best , perform translation transformation according to equation (44), and update the best and f best after the translation transformation.
Step 6: Repeat steps 3-5. When the given termination condition θ ≤ ζ is satisfied or the number of iterations is greater than the given number of times, find a set of parameter vectors that minimizes the objective function as parameters for critic NN and actor NN; Step 7: According to the weights and basis functions of the neural network obtained by optimization and the current status of each subsystem collected by the system, the realtime optimization control strategy u j for each subsystem is solved according to equation (41).

IV. SIMULATION
Assume that the goethite iron-removal system satisfies the affine nonlinear structure of formula (25). And to verify the proposed decentralized optimization control method of the coherent iron-removal system, simulation experiments are carried out by actual data of the goethite iron removal process. According to these data, the flow rates of the reactors are F b ∈ [110m 3 /h, 120m 3 /h], F ∈ [120m 3 /h, 150m 3 /h], and the effective volume of the reactor is V = 300m 3 . Therefore, for the 1# reactor, the parameters of the concentrations can be set as: Taking the 1# reactor as an example, according to the initial value of the actual system setting state x 10 = [14 1.7 3.6] and the initial controller u 10 = 51.6164 according to the established model (10). The parameters k j1 = 1.4623, k j2 = 1.6693, k j3 = 0.2802 in the model are identified by the least squares method. With the choice of Q 1 = x T 1 x 1 and R 1 = I in the cost function (18), the following functions are selected as the basis functions of the critic network: . The selection rules of ϕ 2 (x), ϕ 3 (x), ϕ 4 (x) are the same as ϕ 1 (x), and the number of hidden m 3 /h layer nodes L V = 12. Similarly, the following function is selected as the basis function for actor network: 4 1,1 , x 4 1,2 , x 4 1,3 ]. And ψ 2 (x), ψ 3 (x), ψ 4 (x) is the same as ψ 1 (x). The number of hidden layer nodes is L u = 12. Similarly, the following is selected as the basis function of the structural neural network:   x 2 1,2 , x 2 1,3 , x 3 1,1 , x 3 1,2 , x 3 1,3 , x 4 1,1 , x 4 1,2 , x 4 1,3 ]. The selection rules of φ 2 (x), φ 3 (x), φ 4 (x) are the same as φ 1 (x). The initial controller u 0 is obtained from actual experience. Set an initial weight vector according to the initial state and the initial control u 0 : θ Taking the 1# reactor as an example, the control strategy (33) is used for closed-loop simulation. Figures 2 (a), (b), (c) (d) are the Fe 2+ , Fe 3+ concentrations in 1#-4# reactor, respectively. Figures 3 (a), (b), (c) and (d) are the compared results of oxygen consumption between the proposed optimal control and initial control in the 1#-4# reactor, respectively. Figures 4 (a), (b), (c) and (d) are the changes of the pH value in the 1#-4# reactor within two hours, respectively. It can be seen from Figure 2 that the Fe 2+ ion concentration in the solution to be treated is reduced from 14g/L to 0.4g/L, while the Fe 3+ ions are reduced from 1.7g/L to 0.8g/L, Changes in the concentration of these ions all meet the process technical requirements. The fluctuations of the Fe 2+ and Fe 3+ concentrations in the reactors are small, which avoids the formation of some by-products and ensures the smoothness of the goethite process. It can be seen from Figure  3 that the oxygen consumptions of the proposed optimal control have been significantly reduced compared to the initial control. The results of the oxygen consumption comparisons are shown in Table 1. Compared with the initial controller, the proposed optimal control reduced oxygen consumptions of the 1#-4# reactors by 13.73m 3 /h, 10.86m 3 /h, 8.12m 3 /h, 8.91m 3 /h respectively in two hours. The results show that the proposed control method is resources-saving. Table 2 shows the detailed comparison results of pH values, which indicates that compared to the initial control, the proposed optimal control leads to smaller pH value fluctuations of the solution in reactor.

V. CONCLUSION
This paper proposes an off-policy optimal control method based on reinforcement learning for the associated Iron-Removal system. A bounded function is introduced to define the maximum impact of the associated system on the subsystem cost function. The bound function extends the cost function of the nominal system, and optimizes the new cost function to ensure that the cost function of the associated system is not higher than the nominal system cost function, thereby obtaining approximately optimal control. Taking advantages of the large amount of data obtained in goethite iron-removal process, the weight of neural network learned offline, and the strategy of solving optimal control online, this method provides convenience for practical operation of industries. Based on the actor-critic structure, a new neural network is introduced to approximate a part of the unknown system structure. In this way, we extended the optimal control method in [22] to the coherent system. And this method relaxes the constraints between parameters and residuals. According to the actual industrial data from the simulation experiment, the two ion concentrations and pH values in the goethite iron-removal process are strictly controlled within the range required by the technological requirement, and the ion fluctuation is less than that under the initial control, which proves the effectiveness of the proposed off-policy optimal control method.