A self-learning and tuning fuzzy logic controller ... - Semantic Scholar

Report 3 Downloads 108 Views
A Self-Learning and Tuning Fuzzy Logic Controller Based on Genetic Algorithms and Reinforcements Hung-Yuan Chung* and Chih-Kuan Chiang Department of Electrical Engineering, National Central University Chung-Li, Taiwan 32054, R.O.C.

This article presents a new method for learning and tuning a fuzzy logic controller automatically. A reinforcement learning and a genetic algorithm are used in conjunction with a multilayer neural network model of a fuzzy logic controller, which can automatically generate the fuzzy control rules and refine the membership functions at the same time to optimize the final system’s performance. In particular, the self-learning and tuning fuzzy logic controller based on genetic algorithms and reinforcement learning architecture, which is called a Stretched Genetic Reinforcement Fuzzy Logic Controller (SGRFLC), proposed here, can also learn fuzzy logic control rules even when only weak information, such as a binary target of ‘‘success’’ or ‘‘failure’’ signal, is available. We extend the AHC algorithm of Barto, Sutton, and Anderson to include the prior control knowledge of human operators. It is shown that the system can solve a fairly difficult control learning problem more concretely, the task is a cart–pole balancing system, in which a pole is hinged to a movable cart to which a continuously variable control force is applied.  1997 John Wiley & Sons, Inc.

I. INTRODUCTION Conventional and modern control theories have successfully dealt with a large class of control problem by mathematically modeling the process and solving these analytical models to generate control actions. But, they need a precise knowledge of the model of the process to be controlled and exact measurements of input and output parameters. However, due to the complexity and vagueness of the analytical models and practical processes, the nonlinear behavior of many practical systems and the unavailability of quantitative data regarding the input–output relations make this analytical approach even more difficult. On the other hand, fuzzy logic controllers which do not require analytical models have demonstrated a number of successful applications: for example, in water *To whom correspondence should be sent. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 12, 673–694 (1997)  1997 John Wiley & Sons, Inc. CCC 0884-8173/97/090673-22

674

CHUNG AND CHIANG

quality control,1 in nuclear reactor control,2 and in automobile transmission control.3 These applications have mainly concentrated on emulating the performance of a skilled human operator in the form of linguistic rules. Although there is an extensive literature describing various applications of fuzzy logic controllers, at present, there are few systematic procedures for design of fuzzy logic systems. The implementation of fuzzy control systems often relies on a substantial amount of heuristic observation to express the control strategy’s knowledge. However, the practical development of such systems presents two critical problems: finding the initial fuzzy control rules and tuning the initial control rules and their membership functions. The conventional method has domain experts first, generating the initial rules and their membership functions, and then refines the rules and membership functions to optimize the final system’s performance by trial and error. However, it is difficult for human experts to examine all the input–output data recorded from this complex process to find and to tune the rules and membership functions within the fuzzy systems. Recent direction of exploration is to design fuzzy logic systems that have the capability to learn itself starting with the self-organizing control (SOC) techniques of Mamdani and his students,4 which has to perform two tasks: first it observes the process environment while issuing the appropriate control decision and second, it uses the results of this decision for further improvement. C.T. Lin and C.S. George Lee5 proposed a fuzzy logic control/decision network which is constructed automatically by learning the training examples itself, but in some learning environments, obtaining exact training data may be expensive. In,6 a reinforcement learning is used to adjust the consequence of fuzzy logic rules for the 1-D pole balancing problem. Kosko proposed a system, called fuzzy cognitive maps, to integrate neural network and fuzzy logic, and the differenential Hebbian learning law is used to learn the causal structure of the system.7 In Ref. 8, feedforward multilayer neural networks are trained through backpagation to act as membership functions for phonemes or syllables in acoustic cue detection, and fuzzy logic is applied on the output of neural networks for speech recognition. H.R. Berenji developed an architecture which could learn to adjust the fuzzy membership functions of the linguistic labels used in different control rules through reinforcements.9,10 In these approaches, either the membership functions or fuzzy rules are still chosen subjectively or the complete fuzzy control problem is not solved. In this paper, we develop an architecture which can learn the fuzzy control rules and adjust the fuzzy membership functions automatically. Connectionist learning methods have been divided into three classes: unsupervised learning, supervised learning, and reinforcement learning. Unsupervised learning methods do not rely on an external teacher to guide the learning process. In the supervised learning class, the teacher provides the learning system with desired outputs for each given input. Learning involves memorizing these desired outputs by minimizing the discrepancy between the actual outputs of the system and desired output. In reinforcement learning class, the teacher provides the learning system with a scalar evaluation of the system’s performance of the task according to some performance measure. The objective of the learning system

SELF-LEARNING FUZZY LOGIC CONTROLLER

675

is to improve its performance, as evaluated by the critic, by generating appropriate outputs. If the supervised learning can be used in control (e.g., when the input–output training data are available), we can show that it is more efficient than the reinforcement learning.11 However, many control problems require selecting control actions whose consequences emerge over uncertain periods for which input–output training data are not readily available. In such a case, the reinforcement learning systems can be used to learn the unknown desired outputs by providing the system with a suitable evaluation of its performance, so the reinforcement learning techniques are more appropriate than the supervised learning. A genetic algorithm (GA) is a parallel, global search technique that emulates natural genetic operators. Because it simultaneously evaluates many points in the search space, it is more likely to converge toward the global solution. A GA applies operators inspired by the mechanics of natural selection to a population of binary strings encoding the parameter space. At each generation, it explores different areas of the parameter space, and then directs the search to regions where there is a high probability of finding improved performance. By working with a population of solutions the algorithm can in effect seek many local minima and thereby increases the likelihood of finding the global minima. In this article, a GA is implemented to design fuzzy logic control rules and seek the optimal linguistic values of the consequences of the fuzzy control rules in the reinforcement learning. This article is organized as follows. In Section II, the tradition fuzzy logic control is introduced. The reinforcement learning, and the credit assignment are described in Section III. In Section IV, a genetic algorithm is described. The proposed approach, a combination of techniques drawn from the fuzzy logic and neural network theory is presented in Section V. In Section VI, Computer simulation results of cart–pole balancing are described. The article closes with a conclusion and a future work in Section VII.

II. FUZZY SET AND FUZZY LOGIC CONTROL A fuzzy set is an extension of a crisp set. Crisp sets allow only full membership or no membership at all, whereas fuzzy sets allow partial membership. In other words, an element may partially belong to a set. In a crisp set, the membership or nonmembership of an element u in set F is described by a characteristic function eF (u), where eF (u) 5 1

if u [ F

eF (u) 5 0

if u Ó F

Fuzzy set theory extends this concept by defining partial memberships, which can take values ranging from 0 to 1: eF : U [ [0, 1]

676

CHUNG AND CHIANG

where U is called the universe of discourse. If the eF (u) is 0 or 1, the fuzzy set is an ordinary set. As a special case, a fuzzy singleton is a fuzzy set containing just one element with degree 1. A fuzzy logic controller comprises four principal components: a fuzzification, a knowledge base, decision-making logic, and a defuzzification. The fuzzification interface measures the values of input variables, and performs the function of fuzzification that converts input data into suitable linguistic value which may be viewed as labels of fuzzy sets. The knowledge base comprises a knowledge of the application domain and attendant control goals. The decision-making logic has the capability of simulating human decision-making based on fuzzy concepts and of inferring fuzzy control actions employing fuzzy implication and the rules of inference in a fuzzy logic. The defuzzification interface converts the range of values of output variables into the corresponding universe of discourse. Because of the partial matching attribute of fuzzy control rules and the fact that the preconditions of rules do overlap, more than one fuzzy rule can fire at a time. The methodology that is used in deciding which control action should be taken as the result of the firing of several rules can be referred to as the conflict resolution. The following example, using two rules, illustrates this process. Suppose that we have the two rules: Rule1: If X is A1 and Y is B1 then Z is C1 Rule2: If X is A2 and Y is B2 then Z is C2 Each rule has an antecedent, of if, part containing several preconditions, and a consequent, or then, part which prescribes the value of one or more output actions. Now, if we have x0 and y0 as the sensor readings for fuzzy variables X and Y, then their truth values are represented by eA1(x0) and eB1( y0) respectively for rule1, and eA1 and eB1 , represent the membership function for A1 and B1 , respectively. Similarly for rule2, we have eA2(x0) and eB2( y0) as the truth values of the preconditions: g1 5 ` (eA1 (x0), eB1 ( y0)) g2 5 ` (eA2 (x0), eB2 ( y0)) where ` denotes a conjunction of intersection operator. Traditionally, fuzzy logic controllers use a minimum operator for `. The control output of rule1 is calculated by applying the matching strength of its preconditions on its conclusion. We assume that z1 5 e21 c1 (g1) and, for rule2, that z2 5 e21 c2 (g2) The above equations show that rule1 is recommending a control action z1 and rule2 is recommending a control action z2 . The combination of the above rules produces a nonfuzzy control action z*, which is calculated using a weighted averaging approach:

SELF-LEARNING FUZZY LOGIC CONTROLLER

677

oi51 gi p zi n oi51 gi n

Z* 5

where n is the number of rules, and z is the amount of control action recommended by rule i.

III. REINFORCEMENT LEARNING In the reinforcement learning, one assumes that there is no supervisor to critically judge the chosen control action at each time step. The learning schemes modify the behavior of a system based on a single scalar evaluation of the system’s output. This is to be contrasted with supervised learning schemes in which a knowledgeable external teacher specifies explicitly the desired output in the form of a reference signal. Since the evaluative signal contains much less information than the reference signal, the reinforcement learning is appropriate for system operating in knowledge-poor environments. The study of reinforcement learning relates to the credit assignment, where, given the performance of a process, one has to distribute reward or blame to the individual elements contributing to that performance. This may be further complicated if there is a sequence of action, which is collectively awarded a delayed reinforcement. In rule-based systems, for example, this means assigning credit or blame to individual rules engaged in the problem-solving process. Samuel’s checkers-playing program is probably the earliest AI program which used this idea12. Michie and Chambers13 used a reward–punishment strategy in their BOXES system, which learned to do cart–pole balancing by discretizing the state space into nonoverlapping regions and applying two opposite constant forces. Barto, Sutton, and Anderson14 used two neuronlike elements to solve the learning problem in cart–pole balancing. In these approaches, the state space is partitioned into nonoverlapping smaller regions and then the credit assignment is performed on a local basis. The main advantage of the reinforcement learning is that it is easy to implement because, unlike backpropagation which computes the effect of changing a local variable, the ‘‘credit assignment’’ does not require any special apparatus for computing derivatives. So reinforcement learning can be used in complex system in which it would be very hard to analytically compute reinforcement derivatives. The objective of reinforcement learning is to try to maximize some function of this reinforcement signal, such as the expectation of its value on the upcoming time step or the expectation of some integral of its value over all future time, as appropriate for the particular task.

IV. GENETIC ALGORITHMS GAs have some properties that make them inviting as a technique or selecting high-performance rules for FLC. Due to these properties, GAs differ fundamentally from more conventional search techniques. GAs consider many points from the search space simultaneously and therefore have a reduced chance of

678

CHUNG AND CHIANG

converging to local optimum. The process by which populations are generated and tested is similar to a natural population of biological creatures in which successive generations of organisms are produced and raised until they themselves are ready to reproduce. In most conventional search techniques, a single point is considered based on some decision rules. These methods can be dangerous in multimodal search spaces because they can converge to local optimum. However, GAs generate entire populations of points, test each point independently, and combine qualities from existing points to a new population containing improved points. Aside from producing a more global search, the GA’s simultaneous consideration of many points makes it highly adaptable to parallel processors, since the evaluation of each point is an independent process. A genetic algorithm in its simplest form uses tree operator: Reproduction, Crossover, and Mutation. Reproduction. Reproduction is a process in which individual strings are copied according to their objective function values, f (biologists call this function the fitness function). Intuitively, we can think of the function f as some measure of profit, utility, or goodness that we want to maximize. The strings with a higher value have a higher probability of contributing one or more offspring and those with a lower value have a lower probability of contributing one or less offspring in the next generation. To achieve this, the strings are selected according to what has become known as the stochastic remainder selection without replacement.15 Crossover. Simple crossover may proceed in three steps. First, two newly reproduced strings are selected from the mating pool formed through reproduction. Second, a position along the two strings is selected uniformly at random. For example, the following binary coded string A and B of length 8 are shown aligned for cross over . A: 01011 .. 101 . B: 10010 .. 010 Notice how crossing site five has been selected in this particular example through random choice, although any of the other three positions were just as likely to have been selected. The third step is to exchange all characters following the crossing site. A9 and B9 are two new strings following this crossing: A9 : 01011010 B9 : 10010101 String A9 is made up of the first part of strong A and the tail of string B. Likewise, string B9 is made up of the first part of string B and the tail of string A. Although crossover has a random element, it should not be thought of as a random walk through the search space. Mutation. Reproduction and crossover give GAs the majority of their search power. The third operator, mutation, enhances the ability of the GA to find near-optimal solutions. Mutation is the occasional alteration of a value at a particular string position. It is an insurance policy against the permanent loss of any simple bit. A generation may be created that is void of a particular character at a given string position.

SELF-LEARNING FUZZY LOGIC CONTROLLER

679

Figure 1. The architecture of SGRFLC.

V. THE SGRFLC ARCHITECTURE In this section, we present the theory of the proposed fuzzy logic controller, which has the capability to learn its own control rules and to tune the membership functions without expert knowledge. Figure 1 shows the architecture of SGRFLC, which uses a genetic algorithm designing fuzzy logic controller in reinforcement learning, where the main elements are the action evaluation network (AEN), which acts as a critic and provides advice to the main controller, the action generation network (AGN), which includes a fuzzy controller, the stochastic action modifier (SAM), which using both F and V to product an action F9 applied to the plant, and accumulator which is used as a relative measure of fitness for genetic algorithms. The AGN is a multilayer network representation of a fuzzy logic controller,16 whose component has four elements: a GA rule base, a fuzzification, decision-making logic, and a defuzzification. Here, the GA rule base is a collection of fuzzy conditional statements, but those if–then rules are selfgenerating by GA. In this article, the membership functions of fuzzy variables in antecedent parts will be modified by reinforcement learning, and those which we take fuzzy singletons, in consequent parts are generated by GA. Using GA, we will obtain the optimum fuzzy singletons of the consequent parts. More information about the two networks will be presented as follows: A. The Action Evaluation Network The AEN, proposed by H.R. Berenji,9 plays the role of an adaptive critic element (ACE)14 and constantly predicts reinforcements associated with different input states. The only information received by the AEN is the state of the physical

680

CHUNG AND CHIANG

Figure 2. The action evaluation network.

system in terms of its state variables and whether or not a failure has occurred. It is a two-layer feedforward network with sigmoids everywhere except in the output layer. The structure of the AEN contains h hidden units and n inputs units which includes a bias unit. Each hidden unit receives n inputs and has n weights, while each output unit receives n 1 h weights in this network. Figure 2 shows the structure of AEN. The learning algorithm is composed of Sutton’s AHC algorithm17 for the output unit and an error backpropagation algorithm18 for the hidden units. The output of the units in the hidden layer is yi [t, t 1 1] 5 g

SO n

aij [t] xj [t 1 1]

j51

D

where g(s) 5

1 1 1 e2s

where t and t 1 1 are successive time steps. The output unit of the evaluation network receives inputs from both units in the hidden layer and directly from the units of the input layer: v[t, t 1 1] 5

O b [t] x [t 1 1] 1 O c [t] y [t, t 1 1] n

n

i

i51

i

i

i

i51

where v is the prediction of reinforcement. In the above equations, double time dependencies are used to avoid instabilities in the updating of weights.19 This network evaluates the action recommended by the action network as a function of the failure signal and the change in state evaluation based on the state of the system at time t 1 1: ¯ r [t 1 1] 5 0, ? ? ? start state ¯ r [t 1 1] 5 r [t 1 1] 2 v [t, t], ? ? ? failure state ¯ r [t 1 1] 5 r [t 1 1] 1 cv [t, t 1 1] 2 v [t, t], ? ? ? otherwise

SELF-LEARNING FUZZY LOGIC CONTROLLER

681

where 0 # c # 1 is the discount rate. In other words, the change in the value of v plus the value of the external reinforcement constitutes the heuristic or ¯ internal reinforcement r, where the future value of v are discounted more the further they are from the current state of the system. Learning In AEN It is similar to a reward/punishment scheme for the weights updating in this network. The weights connecting between the units in the input layer directly to the units in the output layer are updated according to the following: ¯ bi [t 1 1] 5 bi [t] 1 rb r [t 1 1] xi [t] ¯ where rb . 0 is a constant and r [t 1 1] is the internal reinforcement at time t 1 1. Similarly, the weights connecting between the hidden layer and the output layer are updated as follows: ¯ ci [t 1 1] 5 ci [t] 1 rc r [t 1 1] yi [t, t] where rc . 0 is a constant. The weight update function for the hidden layer is based on a modified version of the error back-propagation algorithm.18 Since it ¯ is impossible to measure error directly, as in Anderson,16 plays the role of an error measure in the update of the output weights. The weights update as follows: ¯ aij [t 1 1] 5 aij [t] 1 rar [t 1 1] yi [t, t](1 2 yi [t, t]) sgn(ci [t]) xj [t] where ra . 0. Note that, the sign of the weight of the hidden output is used, rather than its value. This variation is based on Anderson’s empirical result that the algorithm is more robust if the sign of the weight is used rather than its value. B. Action Generation Network The AGN contains a fuzzy controller modeled by a five-layer neural network. It generates an action by implementing a fuzzy inference scheme. The control rules are self-learning in AGN. Figure 3 shows the structure which is presented as a network with five layers of nodes, each layer performing one stage of the fuzzy inference process. Layer 1. The nodes in this layer are input linguistic variables in the antecedent parts of control rules. Those notes just transmit input values to the next layer directly and no computation is done. Layer 2. If we use a single node to perform a simple membership function, then the output function of this node should be this membership function. For example, if large is one of the values that x can take, a node computing elarge (x) belongs to layer 2. It has exactly one input, and will feed its output to all the rules using the clause if x is large in their if part. For the bellshaped function, e(x) 5 exp(2(x 2 a)2 /b) where a is the center, b is the spread of the membership function. For triangular shapes, this function is given by

682

CHUNG AND CHIANG

Figure 3. The action generation network.

S

e(x) 5 max 1 2

ux 2 au ,0 b

D

Triangular shapes used to be preferred because they are simple and have been proven to be sufficient in scores of application domains. Layer 3. It implements the conjunction of all the antecedent conditions in a rule. A node in layer 3 corresponds to a rule in the rule base. The numbers of nodes in this layer depend on the numbers of nodes in layer 2. For example, if there are two input variables in layer 1 and both of them have five membership functions in layer 2, then it is easy to obtain 5 3 5 rules in layer 3. The note itself performs the fuzzy AND operation, which is the minimum of product operation traditionally. The minimum operation is Or 5 gr 5 min(e1(x), . . . , ei (x), . . . , en (x)) The product operation is Or 5 gr 5 e1 (x) 3 ? ? ? 3 ei (x) 3 ? ? ? 3 en (x) where ei (x) is the degree of match between a fuzzy label occurring as one of the antecedents of rule r and the corresponding input variable n is the number of the rule nodes. This operation gives gr , the degree of applicability of rule r. Layer 4. The node in this layer corresponds to a consequent label. Its input comes from all rules which use this particular consequent label. For each of the gr supplied to it, this node computes the corresponding output action as suggested by rule r. This mapping may be written as e21 c (gr), where c indicates a specific consequent label, and inverse is taken to mean a suitable defuzzification proce-

SELF-LEARNING FUZZY LOGIC CONTROLLER

683

dure applicable to an individual rule. In general, the mathematical inverse of e may not exist if the function is not directly monotonic. We usually use a simple procedure to determine this inverse,9 that e21 c (gr) is the x-coordinate of the centroid of the set hx : ec (x) $ gr j. It will be the center, where ec (x) 5 1, of the consequent fuzzy set, if the membership function is symmetrical. It is similar to the procedure above to use a fuzzy singleton of the consequent label. In general, it is easy to product the antecedent part of a fuzzy control rule, but it is very difficult to product the consequent part without expert knowledge. Since we do not know how to select which consequent labels in the then part of that rule. A node in layer 4 corresponds to a rule node in layer 3. The consequent labels value of each node in this layer are self-generating by GA. Here, a simple genetic algorithm described as Section IV is used. When applying the three-operator GA to a search problem, which isn’t difficult, two decisions must be made: how to code the possible solutions to the problem as finite bit strings, and how to evaluate the merit of each string (also called decode). Coding. We must first code the decision variables as some finite-length string to use a genetic algorithm. The variable, the consequent label of each node in this layer, will be coded as a binary unsigned integer of length m. Since n rules are possible, and each rule is represented as an m-bit string, a string of length n 3 m represented every possible rule set for the fuzzy logic controller. Decoding. The decoded value of each node is used as the fuzzy singleton of the consequent label. This values according to the following: d 5 fmin 1

b ( fmax 2 fmin) (2 2 1) m

where d is the value of parameter being coded, b is the integer value represented by an m-bit string, fmax is user-determined maximum and fmin is the minimum. Layer 5. In this layer, it has as many nodes as there are output action variables. Each output node combines the recommedations from all the fuzzy control rules in the rule base, using the following weighted sum, the weights being the rule strengths:

oi51 gi e21(gi) n oi51 gi n

F5

where n is the total rule numbers. This layer acts as the defuzzifier. Learning In AGN A natural measure of performance for a reinforcement learning system is E hr u W j, where E denotes the expected value of the reinforcement signal for the given setting of the parameters W. Here, W is the vector of all the weights in the network, which includes the centers and spreads of all antecedent labels used in the fuzzy rules. The objective of the reinforcement learning system is to search the space of all possible weights W for a point where E hr u W j is maximum. Hence,

684

CHUNG AND CHIANG

we will compute F to maximize E, so that the system ends up in a good state and avoid failure. This can be done by gradient descent, that is, DWi Y ­E/­Wi . To know ­E/­Wi , we need to know ­E/­F, where F is the output of the action network. The chain rule can be described as following: DW 5 h

­E ­F ­E 5h ­Wi ­Wi ­F

where h is a constant. In our learning algorithm, the gradient information, ­E/­F, is estimated by the stochastic learning method.20 The gradient information is estimated as F 9(t) 2 F (t) ­E (t) 5 (r (t) 2 v (t)) ­F(t) s (t) if this noise has caused the unit to receive a reinforcement signal that is more than the expected reinforcement, then it is desirable for the unit to have an activation closer to the current activation F 9(t). It should update its mean output value in the direction of the noise. That is, if the noise is positive, the unit should update its weights so that the mean value increases. Conversely, if the noise is negative, the weights should be updated so that the mean value decreases. On the other hand, if reinforcement received is less than the expected reinforcement, then the unit should adjust its mean in the direction opposite to that of the noise. ¯ In this paper, r (t is substituted for r (t) 2 v (t). For antecedent labels, the triangular shapes are used as membership function in the AGN as follows:

S

eip (xi) 5 max 1 2

uxi 2 aipu ,0 bip

D

where ei is the membership function for the ith input variable x (i 5 1, . . . , n) in the pth rule. ai is the center of the membership function. bi is the spread of the membership function. In order to modify or tune membership functions at the input layer through learning, the following equations are used based on a modified version of the error backpropagation algorithm.21

oi51 gi zi F5 n oi51 gi n

Daip 5 ha

­E ­aip

­F zp 2 F 5 ­gp eip ­eip sgn(aip 2 xi) 52 ­aip bip The above derivatives can now be combined to obtain the gradient:

SELF-LEARNING FUZZY LOGIC CONTROLLER

685

¯ F 9 2 F zp 2 F gp sgn(aip 2 xi) . Daip 5 2ha r n s oi51 bip gi eip For the Dbi , the calculations proceed similarly: Dbip 5 hb

­E ­bip

­eip uaip 2 xi u 5 ­bip b2ip ¯ F 9 2 F zp 2 F uaip 2 xi u Dbip 5 hb r n s b2ip oi51 where ha and hb are the learning rate factors. The derivatives can be computed by each node after receiving relevant values backpropagated through the network. The only nodes whose weights will change are those in layer 2 and layer 4. All other edges have weights fixed at 1. C. Stochastic Action Modifier Consider an abstract machine that randomly selects actions according to some stored probability distribution and receives feedback from the environment evaluating those actions. The machine then uses the feedback to update its distributions so as to increase the expectation of favorable evaluations for future action. Such a machine is called a stochastic learning. Here, we use the value of v and the action F to stochastically generate an action F 9 which is actually applied to the plant. It is defined as F 9(t) p C (F (t), s (t)) where C is a normally distributed random variable with mean F and standard deviation s (t). This predictive reinforcement is used to compute the standard deviation as

s (t) 5 S (v (t)) where S (?) is a monotonically decreasing, nonnegative function of v (t). Moreover, s (1.0) 5 0.0, so that when the maximum reinforcement is expected, the standard deviation is zero. The stochastic perturbation in the suggested action leads to a better exploration of state space and better generalization ability. If the prediction of reinforcement is high, the magnitude of the deviation uF 9 2 F u is small and so s should be small. Conversely, if the prediction of reinforcement is low, s should be large so that the unit explores a wider interval in its output range. The result is that a large random step away from the recommendation results when the last action performed is bad, but the controller remains consistent with the fuzzy control rules when the previous action selected is a good one. The actual form of the function s (?), especially its scale and rate of decrease, should take the units and range of variation of the output variable into account.

686

CHUNG AND CHIANG

D. Accumulator Accumulator plays a role which is a relative performance measure. It accumulates the steps until a failure occurs. In this paper, the feedback takes the form of an accumulator that determines how long the pole stays up and how long the cart avoids the end of the track; this is used as a relative measure of fitness for a genetic algorithm.

VI. SIMULATION RESULTS In this section, we apply the SGRFLC approach to the cart–pole balancing problem.

A. The Cart–Pole Balancing Problem The cart–pole task involves a pole hinged to the top of a wheeled cart that travels along a track. The cart and pole are constrained to move within the vertical plane. The state is specified by four real-valued variables: x: horizontal position of the cart, x: velocity of the cart; u : angle of the pole with respect to the vertical line; u : angular velocity of pole; f : force, (210, 10) newtons, applied to the cart. The dynamics of the cart–pole system are modeled by the following nonlinear differential equations (13): g sin u 1 cos u

u5

F

l x5

G

ep u 2f 2 m/ u 2 sin u 1 ec sgn(x) 2 mc 1 m ml

F

4 m cos2 u 2 3 mc 1 m

G

f 1 ml [u 2 sin u 2 u cos u ] 2 ec sgn(x) mc 1 m

where g is the acceleration due to gravity, mc is the mass of the cart, m is the mass of the pole, l is the half-pole length, ec is the coefficient of friction of the cart on track, and ep is the coefficient of friction of the pole on cart. These equations were simulated by the Euler method, which uses an approximation to the above equations, and a time step of 20 ms. The goal of the cart–pole task is to apply the forces of unfixed magnitude to the cart such that the pole is balanced and the cart does not hit the edge of the track. Bounds on the angle and on the cart’s horizontal position specify the states for which a failure signal occurs. There is no unique solution—any trajectory through the state space that does not result in a failure signal is acceptable. The only information regarding the goal of the task is provided by a failure signal, which signals either the pole falling past 612 deg or the cart hitting the bounds of the track at 62 m. These two kinds of failure are not distinguishable in the case considered herein. The goal as just stated makes this task very difficult; the failure signal is a delayed and rare performance measure.

SELF-LEARNING FUZZY LOGIC CONTROLLER

687

Figure 4. SGRFLC applied to cart–pole balancing.

B. Applying SGRFLC To Cart–Pole Balancing Figure 4 presents the SGRFLC architecture as it is applied to this problem. The AEN has four input units, a bias input unit, five hidden units, and an output unit. The input state vector is normalized, so that the pole and cart positions lie in the range (0, 1). The velocities are also normalized, but they are not constrained to lie in any range. The weights of this net are initialized randomly to value in (20.01, 0.01). The external reinforcement is received by the AEN and used to calculate the internal reinforcement. The Action Generation Network

˙ ˙ Layer 1: There are four inputs which are system states, u, u, x and x in this layer. Layer 2: This layer has 16 units which are linguistic variables of the system

688

CHUNG AND CHIANG

Figure 5. Learning curves with population size: (a) 100; (b) 50.

states. States u and u contain five linguistic variables, respectively. States x and x also contain three linguistic variables, respectively. Layer 3: We use 97 rules, not 225 rules (5 3 5 3 3 3 3), in order to simplify the rule nodes for GA. Since we don’t care x and x whenever u is PB(NB) or u is PB(NB). We start with random rules and they will be modified after learning. Layer 4: Each decision variable is coded as a binary unsigned integer of

689

SELF-LEARNING FUZZY LOGIC CONTROLLER Table I. The membership functions before learning and after learning. Before Learning

u

˙ u

x ˙ x

After Learning

Label

Center

Spread

Center

Spread

NL NS ZO PS PL NL NS ZO PS PL NL ZO PL NL ZO PL

212.000000 23.000000 0.000000 3.000000 12.000000 210.000000 25.000000 0.000000 5.000000 10.000000 22.400000 0.000000 2.400000 21.500000 0.000000 1.500000

9.000000 3.000000 3.000000 3.000000 9.000000 5.000000 5.000000 5.000000 5.000000 5.000000 2.400000 2.400000 2.400000 2.400000 2.400000 1.500000

211.999960 22.957778 0.000000 2.9999817 12.001279 210.002113 25.288717 0.000000 4.977542 9.963368 21.986440 0.000000 2.083035 21.0921499 0.000000 1.936101

9.000729 3.021018 3.024111 3.032208 9.000270 5.002999 5.014331 5.006636 5.002936 5.002635 2.428723 2.414287 2.402693 2.402693 2.409821 1.501174

length 5 (m 5 5), thus every rule set for the fuzzy logic controller is represented as a string of length 485. In decoding, fmin is 210 newtons and fmax is 10 newtons. The probability of crossover is 0.7 and that of mutation is 0.001 in our simulations. The population sizes 5, 50, and 100 are used, respectively, later. Layer 5: Only one output unit is used to compute the force. C. Results We implement our system as shown in Figure 4 on a Sun workstation. The parameter values used in our simulation are: half-pole length 0.5 m; pole mass 0.1 kg; cart mass 1.0 kg; c 5 0.95, ra 5 0.35, rb 5 0.03, r 5 0.003. ha 5 0.0001, hb 5 0.0001, bias is 0.1. The learning system is tested for six runs. A run is called ‘‘success’’ whenever the number of steps before failure is greater than 500,000,

Table II. The comparison of trials of our simulations with population size 5, 50, and 100. Pole angle 612 degrees

Cart position 62.4 m

Population Size

Best

Worst

Mean

5 50 100

47 12 6

88 27 19

63 21 11

690

CHUNG AND CHIANG

Figure 6. State performance of the: (a) pole position; (b) cart position.

as used in Bart et al.14 (This time corresponds to about 2.8 h of real time). The external reinforcement r (t) is 21 when the failure signal occurs; otherwise, it is 0. The starting membership functions are set to the values of the membership functions of the population with maximum objective function value after each failure.

691

SELF-LEARNING FUZZY LOGIC CONTROLLER Table III. Parameter changes of length and mass of pole. Pole

Pole

Mass

Length

No. of Additional trials

Mass

Length

No. of Additional trials

5x 10x

Same Same

0 0

10x 20x

(1/5)x (1/10)x

0 0

Figure 5 (a, b) shows the results of our simulations with population sizes 100 and 50, respectively, here the start positions of all four state variables are set to 0 after each failure. Table II shows the comparison of results of our simulations with population size 5, 50, and 100. In each population size, it is tasted for 4 runs, and one run is terminated after 500,000. Our simulations show that increasing population can obtain better performance. However, smaller populations tend toward faster learning, but with poorer behavior. Figures 6 (a, b) shows the value for the pole angle and cart position. In each figure, the first 2,000 time steps show the performance of the controller during the initial portion. The second 2,000 time steps show the performance of the controller after 200,000 time steps. The last 2,000 time steps show the end of the trial in which the controller learned to balance the system for at least 500,000 time steps. The initial membership functions of all the labels and the modifications after learning are shown in Table I. In order to show the adaptation of the system, the length and mass of the pole are changed. Four experiments are done as shown in Table III. In the first two, the original mass of the pole is increased by factors of 5 and 10, respectively. The original mass of pole is increased by a factor 10 and the length of the pole is reduced to one-fifth of the original value in the third one. In the last one, the original mass of pole is increased by a factor 20 and the length of the pole is reduced to one-tenth of the original value. Without any further trials, the system successfully completes these tasks.

Table IV. The new membership functions of all labels in the antecedant parts.

u

x

Label

Center

Spread

NL NS ZO PS PL NL ZO PS

212.000000 26.000000 0.000000 6.000000 12.000000 22.400000 0.000000 2.400000

9.000000 6.000000 6.000000 6.000000 9.000000 2.400000 2.400000 2.400000

The new membership functions before learning.

˙ u

˙ x

Center

Spread

220.00000 210.00000 0.000000 10.00000 20.00000 23.000000 0.000000 3.000000

10.00000 10.00000 10.00000 10.00000 10.00000 3.000000 3.000000 3.000000

692

CHUNG AND CHIANG

Figure 7. State performance, using new membership functions, of the: (a) pole position; (b) cart position.

In the further experiments, we vary the membership functions of all the labels in the antecedant parts, which are shown in Table IV. Figure 7 (a, b) shows the pole angle and cart position using the new membership functions; here the start positions of all four state variables are set to 0 after each failure. Next, we vary the failure signal: the position of the pole is set to 24 deg, the

SELF-LEARNING FUZZY LOGIC CONTROLLER

693

Table V. The results with variable failure signals. Population Size

Best

Worst

Mean

Pole angle 624 degrees 100 7

Cart position 62.4 m 32 18

Pole angle 612 degrees 100 7

Cart position 61.2 m 29 18

Pole angle 636 degrees 100 11

Cart position 60.8 m 42 25

position of the cart is set to 1.2 m. The results are shown in Table V. Finally, the initial state of u is randomly chosen after each failure. The results are shown in Table VI.

VII. CONCLUSIONS In this paper, we have proposed a SGRFLC which is a new way of selflearning and tuning fuzzy controller. The approach in this paper may be viewed as a step in the development of a better understanding of how to combine a fuzzy logic controller with a neural network to achieve a significant learning capability. The computer simulation results show that the SGRFLC can generate the control rules and membership functions of consequent parts and can tune the membership functions of antecedant parts automatically without expert knowledge. Because a GA can seek the optimum membership functions in the consequence of control rules, and reinforcement learning can tune inappropriate definitions of membership functions in the antecedant of control rules, SGRFLC can design better fuzzy logic systems. It is applicable to control problems for which the analytical models of the process are unknown, since SGRFLC learns to predict the behavior of a physical system through its action evaluation network.

Table VI. The results with randomized start for u after each failure. Randomized Start Pole angle 612 deg Cart position 62.4 m Population Size

Best

Worst

Mean

100

46

194

102

694

CHUNG AND CHIANG

References 1. O. Yagishita, O. Itoh, and M. Sugeno, ‘‘Application of fuzzy reasoning to the water purification process,’’ in M. Sugeno, Ed., Industrial Application of Fuzzy Control, North-Holland, Amsterdam, 1985, pp. 19–40. 2. J.A. Bernard, ‘‘Use of rule-based system for process control,’’ IEEE Contr Syst. Mag., 8(5), 3–13 (1988). 3. Y. Kasai and Y. Morimoto, ‘‘Electronically controlled continuously variable transmission,’’ Proc. Int. Congress on Transportation Electronics, Dearbon, MI, 1988. 4. T.J. Procyk and E.H. Mamdani, ‘‘A linguistic self-organizing process controller,’’ Automatica, 15(1), 15–30 (1979). 5. C.T. Lin and C.S. George Lee, ‘‘Neural-network-based fuzzy logic control and decision system,’’ IEEE Trans. on Computers, 40(12), 1320–1336 (1991). 6. C.C. Lee, ‘‘A self-learning rule-based controller employing approximate reasoning and neural net concepts,’’ Int. J. Intell. Syst. 6, 71–93 (1991). 7. B. Kosko, ‘‘Adaptive inference in fuzzy knowledge networks,’’ Proc. 1987 Int. Joint Conf. Neural Networks, 1987, pp. II, 216–268. 8. A. Amano and T. Aritsuka, ‘‘On the use of neural networks and fuzzy logic in speech recognition,’’ Proc. 1989 Int. Joint Conf. Neural Networks, 1989, pp. 31–305. 9. H.T. Berenji, ‘‘A reinforcement learning-based architecture for fuzzy logic control,’’ Int. J. of Approximate Reasoning, 6, 267–292 (1992). 10. H.R. Berenje and P. Khedkar, ‘‘Learning and tuning fuzzy logic controllers through reinforcements,’’ IEEE. Trans. on Neural Networks, 3(5), 724–740 (1992). 11. A.G. Barto and M.I. Jordan, ‘‘Gradient following without backpropagation in layered network,’’ Proc. IEEE First Annual Conf. Neural Network, pp. 629–636, 1989. 12. Samuel, A.L., ‘‘Some studies in machine learning using the game of checkers,’’ IBM J. Res. Develop (1959). 13. D. Mechie and R.A. Chambers, ‘‘Boxes: An experiment in adaptive control,’’ Machine Intell. 2, 137–152 (1968). 14. A.G. Barto, R.S. Sutton, and C.W. Anderson, ‘‘Neuronlike adaptive elements that can solve difficult learning control problem,’’ IEEE Trans. Syst. Man. Cybern., SMC-13, 834–846 (1983). 15. D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1989. 16. C.W. Anderson, ‘‘Learning and problem solving with multilayer connectionist system,’’ Ph.D. thesis, Univ. Massachusetts (1986). 17. R.S. Sutton, ‘‘Temporal credit assignment in reinforcement learning,’’ Ph.D. thesis, Univ. Massachusetts (1984). 18. D. Rumelhart, G. Hinto, and R.J. Williams, ‘‘Learning internal representation by error propagation,’’ in D. Rumelhart and J. McCelland, Eds., Parallel Distributes Processing, MIT Press, Cambridge, MA, 1986, pp. 318–362. 19. C.W. Anderson, ‘‘Strategy learning with multilayer connectionist representation,’’ Tech, Rep. TR87-509.3, GTE Laboratories Inc., May 1988. 20. Vijaykumar Gullapalli, ‘‘A stochastic reinforcement learning algorithm for learning real-valued functions,’’ Neural Networks, 3, 671–692 (1990). 21. D. Rumelhart, G. Hinton, and R.J. Williams, ‘‘Learning internal representations by error propagation,’’ in D. Rumelhart and J. McCelland, Eds., Parallel Distributes Processing, MIT Press, Cambridge, MA, 1986, pp. 318–362.