Overcoming Incomplete Perception with Utile Distinction Memory R. Andrew McCallum
Department of Computer Science University of Rochester Rochester, NY 14627
[email protected] Abstract This paper presents a method by which a reinforcement learning agent can solve the incomplete perception problem using memory. The agent uses a hidden Markov model (HMM) to represent its internal state space and creates memory capacity by splitting states of the HMM. The key idea is a test to determine when and how a state should be split: the agent only splits a state when doing so will help the agent predict utility. Thus the agent can create only as much memory as needed to perform the task at hand|not as much as would be required to model all the perceivable world. I call the technique UDM, for Utile Distinction Memory.
1 INTRODUCTION As researchers explore the problem of learning situated behaviors for robotic tasks, they face the dual blessing and curse of perceptual aliasing [Whitehead and Ballard, 1990]. Perceptual aliasing occurs when there is not a one-to-one mapping between the agent's perception of the world and dierent world situations. It is a blessing because it can represent as equivalent dierent world states in which the same action is required. It is a curse because it can also confound dierent world states in which dierent actions are required. Ideally one would like to keep the good generalization and get rid of the bad. The task of making desirable distinctions between perceptually aliased states is made more dicult when the agent also suers from incomplete perception [Chrisman et al., 1991]. If, at any time, no redirection of the perceptual system can produce immediate percepts that distinguish two signi cantly dierent world states, then the agent has incomplete perception. World states are signi cantly dierent if the agent's best action is not the same in each. Thus incomplete perception is a function of the perceptual system,
the world and the task. For example, in a maze constructed of identical-looking barriers, two hallways in dierent locations may appear the same no matter how carefully one examines the current surroundings. In such situations the agent will require some additional information to disambiguate the aliased states. One way to do this is by introducing some kind of memory of past percepts and actions. The agent could then distinguish its current location from identicallooking maze locations by \remembering how it got there." (For instance, even if I were walking around our department blindfolded, I might know that I'm in the north hallway instead of the south hallway because I remember that I turned right after exiting the elevator.) Chrisman has presented a technique, called the Perceptual Distinctions Approach, to solve the incomplete perception problem [Chrisman, 1992]. The agent builds a hidden Markov model [Rabiner, 1989] that acts as a predictive model of the environment. The states of the HMM are the internal states of the agent, and estimated future discounted reward values for each action, or state-action values, are stored in each of the HMM states. The agent uses Watkins' Q-learning [Watkins, 1989], except that the agent retrieves and adjusts Q-values using the state-action values from all the HMM states proportionally according to the HMM state occupation probabilities. The technique increases the number of HMM states based on statistical signi cances in the agent's ability to predict perception. Because the new state occupation probabilities of a HMM depend not only on the current percept, but also on the previous state probabilities and the previous action, a HMM state can represent memory. For instance, it is possible to set the transition probabilities in a HMM such that it is extremely unlikely that the agent has arrived in a particular state unless the agent has come from some other particular previous state. By building chains of such states with exclusive transition probabilities, the occupation of a state can represent an arbitrary amount of memory about past percepts and actions.
state space large enough to represent all the world as the agent perceives it. If the agent's task is simple, but the world is complex and the agent can perceive manifestations of the world complexity, we would still like the agent's internal state space to be simple. The diculty of learning should be proportional to the dif culty of the task, not the complexity of the world.
Plain table-based Q-learning Perception
Actions
1100 Q0 Q1 Q2 Q3
2 UTILITY-BASED DISTINCTIONS FOR MEMORY
States
Future discounted reward for each action
Q-learning with a HMM State Perception
1100
Influences on choosing current state: perception transition probability from previous state and action
Figure 1: In plain table-based Q-learning the current state is determined entirely by the current percept. Using a HMM, the agent can determine its current state using the current percept, and the previous action and previous state. Plain Q-learning internal states correspond to rows of the Q-table. The internal states of an agent using HMM's correspond to states of the HMM. In each case an internal state holds a collection of future discounted reward estimates, one for each action. The Utile Distinction Memory (UDM) method presented in this report shares the HMM foundation with Chrisman's technique. It uses a HMM to represent the agent's internal state space; it keeps state-action values in the states of the HMM, and it retrieves and adjusts the state-action values according to state occupation probabilities. The chief dierence lies in the way that the size of the HMM state space is increased. Instead of creating new states based on predicting perception, UDM creates new states only when it can show statistically signi cant increases in the agent's ability to predict reward. This dierence causes a fundamental shift in the agent's representational approach. Utility-based distinctions will build an agent internal state space only as large as needed to perform the task at hand1, whereas perception-based distinctions will build a 1
In reinforcement learning good task performance is deby the reward the agent receives.
ned
How can the agent know when splitting a state will help it predict utility? Often a perceptually aliased state that aects utility will have wildly uctuating reward values, however, we cannot base a state splitting test solely on reward variance; some changes in reward are caused by the stochastic nature of the world, and splitting the state will not help the agent more consistently get high reward. The agent must be able to distinguish between changes in reward caused by a stochastic world, and changes that could be predicted after performing a split. What new information does splitting a state give the agent? After splitting a state the agent has the ability to distinguish between two previously aliased states based on knowledge about which state it came from. The key idea behind Utile Distinction Memory is as follows: UDM keeps future discounted reward information associated with the incoming transitions to a state. If the state satis es the Markov property then all the reward values should be similar; however, if there are statistically signi cant dierences in the rewards, then the identity of the state the agent comes from is signi cant to predicting reward, the state must be non-Markovian, and splitting this state would help the agent more consistently predict reward. One split may allow further distinctions because it will create new separate transitions into other states. In that UDM can recursively build a tree of distinctions it is similar to Chapman and Kaelbling's G algorithm [Chapman and Kaelbling, 1991], except that it builds distinctions in memory space instead of perception space.
3 DETAILS OF THE ALGORITHM A slightly longer description of UDM can be found in [McCallum, 1992]. A hidden Markov model is comprised of a nite set of states, S = fs1 ; s2 ; :::; sN g and a nite number of observations (percepts), O = fo1 ; o2; :::; oM g. We also include a nite set of actions A = fa1; a2; :::; aK g.2 For each state there is a vector of observation probabilities|we write B(oi jsj ) for the probability of seeing observation oi while in 2 The inclusion of actions actually makes this a Partially Observable Markov Decision Process.
state sj . For each state-action pair there is a vector of transition probabilities|the notation A(sk jsi ; aj ) signi es the probability that executing action aj from state si will result in state sk . The agent's belief about the state of the world is maintained in a vector of its state occupation probabilities, written ~(t) = h1(t); 2(t); :::; N (t)i, where i (t) is the agent's belief that the world state is represented by si at time t. To calculate the state occupation probabilities at time t + 1, the agent uses the \forward" part of the Baum forward-backward procedure [Rabiner, 1989], X j (t + 1) = k B(ot+1 jsj ) A(sj jsi ; at) i(t) (1) i
where constant is needed to make P i(tk+ is1) =whatever 1, and o t+1 is the agent's observation i at time t+1, and at is the action executed by the agent at time t. While choosing actions and altering state-action values with the reward from actions, UDM uses Q-learning superimposed on hidden Markov models, just as in [Chrisman, 1992]. We write q(si ; aj ) for the stateaction value of action aj in state si . The agent obtains Q-values representative of the agent's current belief about the world state from the state-action values kept in the HMM states by letting all the HMM states \vote" in proportion to their occupation probability: X (2) Q(~(t); aj ) = i(t) q(si ; aj ) i
The agent then chooses the action, a , with the largest Q-value, (i.e., a = argmaxa Q(~; a). The stateaction values are updated according to the standard Q-learning rule, again modi ed to use the state occupation probabilities: 8si q(si ; at) = (1 ? i (t))q(si ; at) + i (t)(rt + U(~ (t + 1)) (3) U(~(t + 1)) = max Q(~ (t + 1); a) (4) a where ; 0 < < 1, is the learning rate, ; 0 < < 1, is the temporal discount factor, and U(~ (t + 1)) is the expected utility of the current state given the agent's beliefs about which states it may be in. Eectively the state occupation probability, i(t), becomes part of the learning rate. Thus far we have not diverged from [Rabiner, 1989] and [Chrisman, 1992]. Next I describe the part of the algorithm speci c to UDM. With the incoming transitions to a state, UDM keeps statistics on the future discounted reward received after leaving the state in question. Future discounted reward is also called return. If the state is Markovian with respect to return, then the return values on all the incoming transitions should be similar. However if there are statistically signi cant dierences between any of the incoming transitions, splitting the state and appropriately dividing the transitions between the split states will help the agent predict return.
Confidence intervals of future discounted reward for each of the four possible actions.
State in consideration for splitting.
Four possible outgoing actions
Figure 2: A state in consideration for splitting. Associated with incoming transitions UDM keeps con dence intervals of future discounted reward for each action executed from the state in question. Since two perceptually aliased states in the world may have the same utility, but require dierent actions, UDM actually keeps separate statistics for the dierent actions executed from the HMM state. Statistical signi cance is tested using con dence intervals, which we calculate with the same method described in [Kaelbling, 1990]. For eachPinterval we keep a running count, n, a sum of Pvalues, x, and a sum of the squares of the values, x2. The upper and lower bounds are then calculated by (5) x t(=n?2 1) psn
P
where x = ( x)=n is the sample mean, and s=
s P P n x ? ( x) 2
n(n ? 1)
2
(6)
is the sample standard deviation, and t(=n?2 1) is the Student's t function with n ? 1 degrees of freedom. The parameter , determines the con dence with which values will fall inside the interval. These return statistics kept with transitions are never used to change the agent's state-action values, q(i; A); they are only used to determine when and how to split a state. The agent begins with a fully connected hidden Markov model containing a state for each percept. The observation probabilities, B(oi jsj ), are preset such that each percept is biased toward a dierent state.3 The transition probabilities, A(sk jsi ; aj ), are all made equal. The agent then executes a series of m-step trials. During each trial it updates the q(si ; aj ) values according the equations above, and also keeps a record Currently UDM has the ability to discover bene cial memory distinctions, but not perceptual distinctions, thus the model is initialized to distinguish between all percepts. 3
Confidence intervals on future discounted reward
Before split: 1
2
3
4
r1
5
r2 r1
r2
r3
r4
r3
r5
r4 6
r5
After split: 1
2
6a
3
5
4
6b
Figure 3: Determining when and how a state should be split by using con dence intervals on future discounted reward. Consider all incoming transitions to state 6, and compare the set of con dence intervals corresponding to each action executed from state 6. If any of the intervals do not overlap, then the agent has determined that by treating state 6 dierently depending on which state it came from it can get a statistically signi cant increase in its ability to predict reward. The set of intervals is divided into clusters of overlapping intervals, and each cluster is assigned to a dierent copy of state 6. The con dence interval table in the upper right only shows the intervals for one outgoing action. of the actions, percepts, and rewards in history arrays [t], P[t] and r[t]. At the end of each m-length trial the agent runs the tests for determining what states, if any, should be split. This begins by performing the Baum-Welch procedure for updating the model parameters. The procedure improves the model's ability to use percepts and actions to distinguish the agent's current state, but since the test used for splitting states is separate from Baum-Welch, the agent will not create new memory capacity in order to predict perception. In the process of the Baum-Welch procedure the agent calculates improved estimates of the state occupation probabilities over time, i (t), and estimates of the transition occupation probabilities over time, (si ; aj ; sk jt), (the probability that at time t the agent passed from state si to state sk using action aj ). See [Rabiner, 1989] for details. Then the agent calculates return values over time, return[t], using the reward history, r[t]: return[m] = r[m] for t = m ? 1 to 0 (7) return[t] = r[t] + return[t + 1] where is the temporal discount factor. Because any state could be aliased and could have state-action values resulting from a meaningless combination rewards from dierent world states, the state-action val-
A
ues (and state utilities de ned as the maximum over the state-action values) cannot be trusted, and only reward directly from the world is used to calculate return. Now the agent has all the information necessary to assign return values to speci c transitions. The agent considers the value of return[t] at each time step and includes this value in the statistics for each transition in proportion to the transition occupation probability at the previous time step: for t = 1 to m ? 1 for all transitions, trans, that use action A[t ? 1] lr = trans [t ? 1] trans.countA[t] += lr trans.sumA[t] += lr return trans.sumsquaresA[t] += lr return
[t] (
[t])2
(8) where is the agent's learning rate, lr is the learning rate for a particular transition at a particular time, and trans[t ? 1] is the transition occupation probability (si ; aj ; sk jt ? 1) at time t ? 1 of the particular transition, trans, belonging to source state si , action aj and destination state sk . The trans-dot quantities in the last three lines refer to the three statistics necessary for computing the upper and lower bounds; trans.count is the count, n, trans.sum is the sum, P x, and trans.sumsquares is the sum of squares, P (x2). The A subscripts indicate that they are indexed by the dierent actions executed in the destination state of the transition. Using the count, sum, and sum of squares the agent can obtain upper and lower bounds of the return value con dence intervals for each transition. The agent determines if the returns on two incoming transitions are statistically signi cant by measuring whether or not their con dence intervals overlap. When any two of the con dence intervals with the same outgoing action fail to overlap the state is split. The agent divides the set of incoming transitions into disjoint subsets, such that all transitions in a subset overlap with each other and each subset is as large as possible. To perform this clustering optimally is actually an NP-complete problem (Minimum Cover, [Garey and Johnson, 1979]), but in practice, greedy solutions are adequate. After the clustering, the state is duplicated (including incoming and outgoing transitions) enough times so that there is one copy per cluster. Then, any incoming transitions that are not elements of a duplicate state's assigned cluster are removed. States may also be joined if the union of their incoming transitions all overlap with each other, and if one set of incoming transitions is a subset of the other. If the analysis at the end of any m-length trial results in any splits or joins, the trans-dot statistics are reinitialized.
9
10
8
10
12
5
5
5
7
7
7
Figure 4: One of several mazes successfully solved by UDM. The agent's perception is de ned by a bit vector of length four in which the bits specify whether or not there is a barrier to the agent's immediate north, east, south, west. The numbers in the squares are the decimal equivalents of the bit vectors interpreted in binary. Although the three state 5's are all perceptually equivalent, the agent must learn to go south in the 5 in the center, and go north in the 5's on the right and left. The hidden Markov model built by UDM has two states with high probabilities of observing 5|one representing the 5 in the center and the other combining the two 5's on the right and left. Percept 10 is also aliased, and is split in two since dierent actions are required in the two world states. See Figure 5 for diagrams of the HMM built to solve this maze.
4 EXPERIMENTAL RESULTS I have demonstrated UDM working in a grid world where perception is de ned by immediately adjacent barriers. The Local Perception Grid World (LPGW) is based on simulated worlds used by [Sutton, 1990], [Whitehead, 1992], [Thrun, 1992] and others. The agent moves about a discrete grid by executing the actions North, East, South, and West. Whenever the agent attempts to move into a grid occupied by a barrier, the agent remains where it is and receives a reward of ?1:0. Whenever the agent moves into any of the one or more \goal" squares, it receives a reward of 1:0. For all other actions the agent receives a reward of ?0:1. After reaching the goal square the agent is randomly transported to another location in the maze. The dierence between the LPGW and the other grid worlds is that instead of de ning perception as the unique row and column numbers of the agent's position (a perception that does not cause perceptual aliasing), LPGW de nes perception to be a bit vector of length four in which the bits specify whether or not there is a barrier to the agent's immediate north, east, south and west. This perception is rich with per-
ceptual aliasing possibilities (and is also a bit more realistic for a navigating robot one might build). In some experiments I also added noise to the Grid World. With probability 0:1 the agent perceives some other perception vector than its current world position would have speci ed. Also with probability 0:1 the agent's chosen actions are randomly changed to one of the other four. One maze solved by UDM appears in Figure 4. Without noise, UDM consistently learned the optimal policy for this maze in ve trials of 500 steps each. The agent used a learning rate of = 0:6, a temporal discount factor of = 0:7, a con dence on the upper and lower bounds of (1 ? ) = 0:95, and a random action probability (exploration rate) of 0.1. The sequence of HMM's created appear in Figure 5. Notice that percept 5 is represented by only two states even though there are three world states that produce percept 5. On the sides, where the same action is required, UDM keeps the two world state 5's aliased to the same internal state. With noise, UDM learned the same optimal policy, but the agent required 15 trials of 500 steps each.
5 CONCLUSIONS The Utile Distinction Memory technique provides a refutation to the \Utile-Distinction Conjecture" [Chrisman, 1992], which stated that is was impossible to introduce only memory distinctions that impact utility. As such, UDM acts as a starting point from which to design improved learners that use memory to create selective, task-speci c representations of their environment. There are other mechanisms for creating memory besides hidden Markov models. Lin and Mitchell present successful results with agents that use a xed-size window of past percepts and actions, and also agents that use recurrent neural networks with context units [Lin and Mitchell, 1992]. These methods do not split or add nodes to the network during learning and thus the memory capacity must be known and xed before learning begins. We could also say [Tan, 1991] deals with incomplete perception in that his robot remembers the results of several perceptual actions to determine its internal state. However, it has no mechanism for memory that spans across an overt action and the algorithm only works in deterministic worlds. The rst point prevents Tan's agent from solving tasks like the Local Perception Grid World where perceptual information is limited to only the immediately adjacent squares. The rst point also makes the memory problem much simpler|the agent no longer has to determine what to remember and when to forget: it remembers the results of all its perceptual actions, then throws all its
memory away when it does an overt action. UDM is not without limitations and problems. The computational requirements for the state splitting test are signi cant, both in terms of storage and calculation. Any method that uses the Baum-Welch procedure requires O(KN 2 m) time and storage. As long as K < m, UDM does not increase these requirements, however, the increased constants are not insigni cant. Hopefully UDM's minimal, task-speci c state splitting will allow small enough state spaces to at least partially make up for this. UDM splits states in order to increase memory-based distinctions, but it currently has no method for splitting states to increase perceptual distinctions that predict future reward. For this reason UDM begins with one state per percept, a strategy that will obviously not work for large perception spaces. Some method for making perceptual distinctions will be necessary, and it seems plausible that Chapman and Kaelbling's G algorithm [Chapman and Kaelbling, 1991] or even some technique based on con dence intervals for the unused perception bits in a state should work in conjunction with UDM. Although UDM can build memory chains of arbitrary length, it does require that some statistically signi cant bene t be detectable for each split individually in sequence. UDM has solved mazes with multi-step paths between the reward-predicting percepts and the reward, however, it would not be able to solve problems in which a conjunction of percepts in sequence predicted reward, but each of the percepts on its own was independent of future reward. This assumption about the detectable relevance of pieces of information in isolation is also made by [Chapman and Kaelbling, 1991] and [Maes and Brooks, 1990]; additionally, it is a key factor in Genetic Algorithms. My primary goal in working with memory for learning agents is to use memory to disambiguate the ubiquitous perceptual aliasing that occurs with active perception [McCallum, 1993]. In fact, the use of memory with active perception has the potential to solve UDM's lack of splitting for perceptual distinctions quite nicely. Imagine an agent that chooses from among overt and perceptual actions that each return one bit, and this one bit makes up the agent's entire immediate perception space, (one-bit perception being the extreme case). We can think of the dierent perceptual actions as each returning a dierent bit of the total perception vector available from the current world state. The agent can generalize in perception space by only executing the actions necessary to get the bits that are currently relevant. This scheme is even better than generalizing by using \don't cares" as a mask on the perception vector because the agent's policy can eectively act as a decision tree that speci es which bits are needed|depending on the result of a bit gathered early in the process the agent can decide whether or
not it needs to gather more bits before it executes an overt action. This scheme will also allow the agent to execute open-loop sequences of actions. Taking this route, however, will require addressing the assumption about the detectable relevance of individual spilts because perceptual bits that provide useful information only in conjunction violate the assumption. For example, a student driver learning to pass should pull into the passing lane when the car in front is close, and the rearview mirror is clear, and the blind-spot is clear. Verifying this conjunction requires remembering results from three eye movements. When true, this conjunction is highly correlated with reward, however, any of the bits individually is not.
Acknowledgments
This work has bene ted from discussions with many colleagues, including: Dana Ballard, Mary Hayhoe, Je Schneider, Jonas Karlsson and Polly Pook. I am grateful to Dana Ballard, Je Schneider and Virginia de Sa for making helpful comments on an earlier draft. This material is based on work supported by NSF research grant no. IRI-8903582, NIH/PHS research grant no. 1 R24 RRO6853, and a grant from the Human Science Frontiers Program.
References
[Chapman and Kaelbling, 1991] David Chapman and Leslie Pack Kaelbling. Learning from delayed reinforcement in a complex domain. In Proceedings of IJCAI, 1991. [Chrisman et al., 1991] Lonnie Chrisman, Rich Caruana, and Wayne Carriker. Intelligent agent design issues: Internal agent state and incomplete perception. Working Notes of the AAAI Fall Symposium: Sensory Aspects of Robotic Intelligence, 1991. [Chrisman, 1992] Lonnie Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI-92, 1992. [Garey and Johnson, 1979] Michael R. Garey and David S. Johnson. Computers and Intractability, A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. [Kaelbling, 1990] Leslie Pack Kaelbling. Learning in Embedded Systems. PhD thesis, Stanford University, 1990. [Lin and Mitchell, 1992] Long-Ji Lin and Tom M. Mitchell. Reinforcement learning with hidden states. In Proceedings of the Second International Conference on Simulation of Adaptive Behavior: From Animals to Animats, 1992.
[Maes and Brooks, 1990] Pattie Maes and Rodney A. Brooks. Learning to coordinate behaviors. In Proceedings of AAAI-90, pages 796{802, 1990.
[McCallum, 1992] R. Andrew McCallum. First results with utile distinction memory for reinforcement learning. Technical Report 446, University of Rochester Computer Science Dept., 1992. [McCallum, 1993] R. Andrew McCallum. Learning with incomplete selective perception. Technical Report 453, University of Rochester Computer Science Dept., April 1993. PhD thesis proposal. [Rabiner, 1989] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), February 1989. [Sutton, 1990] Richard S. Sutton. Integrating architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, Austin, Texas, 1990. Morgan
Kaufmann. [Tan, 1991] Ming Tan. Cost sensitive reinforcement learning for adaptive classi cation and control. In AAAI, 1991. [Thrun, 1992] Sebastian B. Thrun. Ecient exploration in reinforcement learning. Technical Report CMU-CS-92-102, CMU Comp. Sci. Dept., January 1992. [Watkins, 1989] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University Psychology Dept., 1989. [Whitehead and Ballard, 1990] Steven D. Whitehead and Dana H. Ballard. Learning to perceive and act. Technical Report 331, University of Rochester Computer Science Dept., June 1990. [Whitehead, 1992] Steven Whitehead. Reinforcement Learning for the Adaptive Control of Perception and Action. PhD thesis, Department of Computer Sci-
ence, University of Rochester, 1992.
Trial 1
10
9
8
9
8
5a
7
7
10
9
8
7
12
5
Trial 3
5a
10
Trial 2
Trial 4
12
5b
5b
10a
9
12
10b
8
12
5a
5b
7a
7b
Figure 5: The sequence of hidden Markov models created by UDM as it learns the task shown in Figure 4. The HMM states are fully connected, but the diagrams only show transitions with probability greater than 0.3. Also not shown here are the dierent actions that cause the transitions. Trial 1: The agent has learned some transition probabilities, but it has not yet gathered enough statistics to split any states. Trial 2: The agent has discovered that the future discounted reward received after leaving state 5 is signi cantly dierent when it arrives from state 8 than it is when it arrives from state 7. UDM has duplicated state 5, creating 5a and 5b, then removed the state 8 incoming transition from 5a and removed the state 7 incoming transition from 5b. Trial 3: The agent has found non-overlapping con dence intervals on the transitions into 5b|arriving from state 8 is signi cantly dierent than arriving from states 9 or 12. UDM temporarily splits 5b into 5b and 5c, but then 5c is joined with 5a. Trial 4: Two splits occur at the end of this trial. Average future reward after leaving the center state 7 is greater than after leaving the side state 7's because from the center state 7 the agent is teleported to locations that are are often closer to the goal. UDM splits state 7 recognizing that arriving from 5a is dierent than arriving from 5b. State 10 is also split because the con dence intervals for leaving 10 by going east and leaving 10 by going west don't overlap in the transitions coming from states 9 and 12. The agent will now arrive in state 10a from state 9 and arrive in state 10b from state 12. There are no more changes to the HMM after trial 5.