Active Online Learning of the Bipedal Walking Dingsheng Luo, Yi Wang, Xihong Wu* Speech and Hearing Research Center, Key Lab of Machine Perception (Ministry of Education) School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China {dsluo, wangyi, wxh}@cis.pku.edu.cn Abstract—For legged robot walking pattern learning, the current mainstream and state-of-the-art researches are most under a socalled computer simulation based framework, where the walking pattern is learned via a pre-established simulation platform. However, when the learned walking pattern is applied to a real robot, an additional adapting procedure is always required, due to the big difference between simulation and real walking circumstances. This turns out to be more critical for a bipedal walking, because its controlling is more difficult than others, such as quadruped robot. In this paper, a novel framework for active online learning bipedal walking directly on a physical robot is proposed. To let the learning procedure to be of both fast convergence and high efficiency, a polynomial response surrogate model, an orthogonal experimental design based active learning strategy as well as a gradient ascent algorithm are used. The experimental results on a real humanoid robot PKU-HR3 show its effectiveness, indicating that the proposed learning framework is a promising alternative for bipedal walking pattern learning. Keywords-humanoid robot; bipedal walking pattern; online learning; surrogate model; active learning
I.
INTRODUCTION
The comprehensive performance with respect to speediness, flexibility and stability is the most important and essential objective for robot locomotion, which is a great challenge since the control for robot walking, especially for bipedal walking, appears to be a highly complex and non-linear dynamic issue. For decades of research, various approaches were successfully proposed to this goal. Typically, these approaches might be categorized as trajectory based methods, such as Zero Moment Point (ZMP) [1][2], Inverted Pendulum Model [3]; Biological inspired methods, such as Central Pattern Generator (CPG)[4][5]; Heuristic control methods, like Virtual Model Control (VMC) [6]. Generally speaking, the problem to reach a satisfied bipedal walking, especially under the trajectory based robot locomotion control, is to optimize a number of parameters, which is often called as walking/gait pattern learning. To obtain an optimal walking pattern, the traditional and immediate method is based on a hand-tuning strategy, in which walking parameters were determined according to human expertise [7]. Although this strategy turns out to be successful, and even received good achievements in Robocup Standard Platform League and Humanoid League competitions [8], some obvious drawbacks still exist, such as time-consuming, no guarantee to reach an optimal value point, etc. To avoid these drawbacks, machine learning approaches have been heavily focused. The most common scheme is computer simulation based approach, where a simulation * Corresponding author, IEEE Senior member.
platform is firstly setup via the modeling of real circumstance, and then based on which some learning algorithm is employed to learn walking pattern, such as Genetic Algorithm [9], Hill Climbing Algorithm [10], Policy Gradient Algorithm [11], Particle Swarm Optimization [12], etc.. This kind of approach helps preventing the difficulties encountered by hand tuning to some extent and has been successfully applied in walking pattern learning. Nevertheless, due to the fact that the real circumstance for robot is usually too complex to be precisely modeled, the output walking pattern from the simulation platform usually could not works well when directly using on a real robot. It will be only selected as an initial input, and comes up with a necessary adaptive learning process to reach a satisfied one. Therefore, the way of learning directly on real robot turns out to be a promising alternative. Theoretically, those learning algorithms involved in computer simulation based framework could be directly employed for real robot based walking pattern learning. Unfortunately, as investigation shows, most those learning algorithms will spend too many evaluations or trails, which however might be unacceptable for a real robot based learning considering the expensive mechanical wear. For example, representative algorithms such as hill climbing, policy gradient, etc. usually take more than 200 evaluations before converged [14]. In [15], by employing the Gaussian Process Regression, the walking pattern was directly learned on real robot, whereas the physical platform is not a bipedal robot but a quadruped robot. In [16], a true biped robot was used as the physical platform, on which a Sequential Surrogate Optimization algorithm was proposed to directly learn the walking pattern. And the times of the learning evaluations was significantly reduced to 50 or so. Yet the surrogate model utilized in [16] was Kriging model, which is a high time complexity process [17]. And furthermore, since the process of extreme point searching for Kriging model is a highdimensional nonlinear problem, and the adopted optimizer SNOPT [18] is also a high time complexity process. This may lead the walking pattern optimization process in [16] to be in an off-line learning style. Obviously, learning for bipedal walking on a real robot in a rapid and online style is highly significant, where the manual intervene is no longer required. Based on above analysis, two necessary conditions are suggested, first, the learning should be more efficient so that the whole process is guaranteed to be online, especially considering the limited computational resources of the physical robot; second, the convergent speed should be fast enough to avoid repetitious expensive mechanical wearing. In this paper, a novel active online learning framework is proposed, where an orthogonal
experimental design based active learning strategy, a polynomial response surrogate model as well as a gradient ascent algorithm are employed to guarantee both rapid convergence and efficient learning. The paper is organized as follows. In Section II, the proposed online learning framework is described. In Section III, experiments are given to evaluate the effectiveness. The conclusion and future work are discussed in Section IV. II.
ONLINE LEARNING BASED ON REAL ROBOT
A. Problem Foundation Given a continuous N dimensional walking pattern space P, where N denotes the number of parameters in a walking pattern, regarding to the objective function: f ( p) : P → R , p ∈ P ,
(1)
modeling approaches. The dominating difference lies in that it is mainly expected to capture the profile characteristics other than detail properties of a system. It is therefore very meaningful when the main purpose is to learn an optimal of a black-box system with some input-output pairs aware. Especially to many real world optimization problems that require expensive and time-consuming experiments and/or simulations to evaluate design objective and constraint functions, surrogate optimization modeling approach is a reasonable choice [19], and has been successfully applied [20][21]. The real robot based walking pattern learning is right this kind of tasks, and hence leads the surrogate optimization to be an ideal choice. Surrogate optimization includes two components, i.e., surrogate model establishing and extreme point searching. During model establishing stage, some suitable parameterized model is selected firstly, as follows:
the goal is to learn an optimal p* that satisfy:
SM ( p, θ ) : P × Θ → R , p ∈ P , θ ∈ Θ,
p * = arg max f ( p), p ∈ P .
(2)
For such a typical reinforcement learning problem, considering that the learning objective is robot walking speed, a commonly used immediate reward function, called as fitness function, could be defined as following according to the idea of literatures [12][13]: if fall down ⎧ 0, ⎪ fitness( p) = ⎨ v , ⎪ v , otherwise ⎩ max
(3)
where v is the velocity over the observation corresponding to walking pattern p, and vmax is the expected maximal velocity. As mentioned above, the stability is also critical in bipedal walking. In order to take into account the stability in the learning objective, both variances of the body forward-back and left-right varying angles are adopted and negated to figure the walking stability: ⎡ ⎛ σ fb stability ( p) = ( −1) ⎢ β ⎜ ⎜ ⎣⎢ ⎝ σ fb _ max
⎞ ⎛ σ lr ⎟⎟ + (1 − β ) ⎜⎜ ⎠ ⎝ σ lr _ max
⎞⎤ ⎟⎟ ⎥ , (4) ⎠ ⎦⎥
where σfb and σlr are respectively the variances of the body forward-back and left-right varying angles over a evaluation related to walking pattern p, and σfb-max and σlr-max are the expected maximum values. Parameter β is the weight used to balance both variances. Then the modified fitness function is as: 0, if fall down ⎧ ⎪ fitnessnew ( p) = ⎨ ⎛ v ⎞ , ⎪α ⎜ v ⎟ + (1− α ) ⋅ stability( p), otherwise ⎩ ⎝ ⎠
(5)
where parameter α is the weight used to control or adjust the gains ratio for both objectives of walking speed and stability. Surrogate Optimization Generally, a surrogate model is aimed to express a system according to some input-output sample pairs same as other
B.
(6)
where P is the walking pattern space, Θ is the parameter space of the selected surrogate model SM. And then parameter θ could be learned given sample pairs set:
D = {( pt , yt )}t =1 , T
(7)
where yt = fitnessnew ( pt ), pt ∈ P ,
(8)
according to (5). That is, surrogate model SM is established, which could be regarded as a representative of the objective function as in (1). And then, during extreme point searching stage, by employing some optimal solving method, the extreme point p* could be achieved according to (9), which is actually a derivation of (2). p* = arg max SM ( p), p ∈ P.
(9)
To online learning bipedal walking pattern, as mentioned above, one necessary condition is high efficiency to let a realtime online learning process possible. Thus, among several kinds of surrogate model, in this research the polynomial response surface model as in (10) is chosen: N
N
N
SM poly ( p, θ ) = θ 0 + ∑ θi pi + ∑∑ θ ij pi p j + ε , i =1
(10)
i =1 j ≥ i
where i, j = 1, …, N, and pi is a parameter in pattern p. Based on a set of given observation sample pairs D expressed in (7), parameters θ can be easily obtained by using a simple Least Mean Square (LMS) algorithm, that is, the polynomial response surrogate model is established. And according to p* = arg max SM poly ( p), p ∈ P ,
(11)
which is derived from (9), p*, the extreme point of SMpoly could be reached.
C. Active Learning Strategy In surrogate model establishing, since the walking pattern space P is multi-dimensional and continuous, how to collect points pi (i=1…T) from P to obtain the sample pairs set D expressed in (7) is an essential problem. Commonly, the larger the number of the points (or the sample pairs) T is, the more precise the constructed surrogate model will be. However, considering the obtaining of each sample pair requires an individual evaluation, it is obvious that the larger T is, the more evaluations are required. Associating to another necessary condition mentioned in Section I, the number of the expensive evaluations T should not be too large. This becomes a dilemma in surrogate modeling for such a real robot based online learning process. In this research, a Design of Experiments (DOE) based active learning strategy is proposed to face such a dilemma. DOE is one of the basic problems in statistical multivariate (multi-factor) analysis, which is an efficient procedure for planning experiments so that the data obtained can be analyzed to yield valid and objective conclusions. An experimental design is the laying out of a detailed experimental plan in advance of doing the experiment. Well chosen experimental designs maximize the amount of information that can be obtained for a given amount of experimental effort. To let that actively collecting of points pi from P to be fit for DOE, firstly, the continuous parameter space P is transferred to a new space Pd by a discretization process. Given that each parameter in a walking pattern is regarded as a factor, and grades of each parameter after discretization are regarded as corresponding factor levels, the problem we faced then becomes a typical DOE problem in a statistical multi-factor analysis task. The orthogonal experimental design is one of the typical DOE methods in statistical multi-factor analysis, and usually behaves effective and economic [22]. Thus, it is adopted to actively collect points from the new parameter space Pd in this study. The orthogonal experimental design can be easily completed via table-look-up on a Taguchi's orthogonal arrays [23]. As a example, for a 3 three-level factors problem, an orthogonal arrays noted as L9(33) could be easily obtained, which lists the experimental designs corresponding to points to be selected, and the subscript 9 indicate the required number of selected points, while the full scale test for this problem requires at least 27 evaluations. The orthogonal DOE based active learning strategy could greatly reduces the times of evaluations in a reasonable way, and thus further leads the whole learning process under an active real-time online manner. D. Gradient Ascent Learning After extreme optimization for surrogate model, a candidate point p* could be obtained. Hereafter, p* is scored as y* = fitnessnew ( p*)
(12)
via the reward function (5) after an evaluation, so that another sample pair dnew is obtained: d new = ( p*, y*).
(13)
According to the idea of sequential surrogate optimization[16], dnew will be added to previous sample set D to form a new sample set Dnew, based on which a next surrogate model could be constructed. However, since there only one point is added, the new surrogate model usually changes little from previous one. To reach a distinct updating of previous surrogate model, it usually requires more evaluations. Other than the sequential surrogate optimization, in this paper, a kind of gradient policy learning is utilized right after the surrogate optimization. Firstly, the best-behaved pattern pbest corresponding to sample pair
d best = ( pbest , ybest ),
(14)
is fixed from D according to following formula: ybest = max ( yt )t =1 , T
( pt , yt ) ∈ D ,
(15)
Then, based on the output dnew and dbest, a gradient direction could be obtained, noted this procedure as Direction(dnew, D). According to this direction, a random walking pattern pinit is selected and evaluated, and then the final walking pattern pfinal could be reached as the gradient ascent algorithm is converged. It is worth mentioning that p* belongs to P, that is, in gradient ascent learning, the optimal searching space is still P, not the temporarily new space Pd. In other words, the discretization operation for continuous space P for active points collecting will not bring any negative influences. The gradient ascent learning will not only reduce evaluations, but also may really provide a refinement for the output of the surrogate optimization that involves the somewhat rough polynomial response surface model. To summarize the learning process of the proposed framework, its pseudo-code is laid out as in Figure 1. Input: orthogonal arrays, T Step 1: pt, t =1,2,…,T
← Orthogonal DOE based active learning for walking pattern pt selection
Step 2: D={(pt, yt)}, t =1,2,…,T ← Sample pairs collection via evaluations Step 3: SMpoly(p)
← LMS algorithm based surrogate model establishing
Step 4: p*= argmax SMpoly(p) ← Surrogate model SMpoly(p) optimizing Step 5: dnew = (p*, y*)
← Pattern p* evaluation
Step 6: Direction(dnew, D)
← First gradient direction determining
Step 7: pfinal
← Gradient ascent learning till converged
Output: pfinal Figure 1. Pseudo-code of the proposed online learning framework
III.
EXPERIMENTS AND RESULTS
A. The PKU-HR3 Robot Platform The PKU-HR3 is the 3rd version humanoid robot developed by Peking University, China, which is 3.51kg weight, 56cm high. It has 21 DOF (degree of freedom), including 6 DOF for hip, 1 DOF for each knee, 2 DOF for each ankle, 1 DOF for torso, 2 DOF for head, 3 DOF for each arm. It uses 3 ROBTIS RX-64 and 18 ROBTIS RX-28 servo actuators. All the servos are connected to the controller via RS485 connection running
at 1Mbps. The PC-104 equipped with inertial sensor is the main control system that placed in the back of PKU-HR3. The inertial sensor has one 3-axes gyroscope, one 3-axes accelerometer, which will be used to measure the stability in this research. The appearance and design structure with measurement in mm of PKU-HR3 are shown in Figure 2.
The walking pattern to be learned is derived from the following four factors: − n_step_ph: The number of sub-actions in each step. Usually, the larger it is the smoother and more stable the robot's walking will be, while the slower the walking will be. − h_COM: The height of the robot COM. − h_foot: The maximal stepping height of the foot above ground surface. − w_feet: The width between two feet during forward walking. The range of walking pattern space P regarding to four factors listed above is shown in Table I. TABLE I.
WALKING PATTERN SPACE
Factors
n_step_ph
h_COM
h_foot
w_feet
Space
[20, 100]
[0.270, 0.320]
[0.020, 0.060]
[0.020, 0.080]
Figure 2. Appearance and design structure of PKU-HR3
The low-level control of PKU-HR3 is realized by a walking engine that is developed under the trajectory based method, where each step is divided into several sub-actions. Using the walking pattern, the walking engine generates two trajectories. One is for Centre of Mass (COM) that is generated by Inverted Pendulum Model as in [24]. The other is for the swinging leg that is defined as in [12] for both directions of up-down (zdirection) and fore-and-aft (x-direction). According to these trajectories that are gated by the learned walking pattern pfinal, sequences of joint angles corresponding to sub-actions in each step are finally obtained via inverse kinematics. B. Experimental Setup The experimental environments with different ground surfaces, i.e., wood floor, carpet surface and ordinary floor, are shown in Figure 3. Apart from the PKU-HR3, a PC laptop, wireless router, external power supplier etc. are also contained in each environment. All of the learning processes are executed by the PC104 equipped on PKU-HR3, while the PC laptop is only used as a monitor which remotely logins to the robot via wireless router. All joint motors are supplied by external power supply to guarantee constant outside conditions during the whole learning process, while the battery is used only for keeping the weight of the robot.
Both weights α and β of the modified reward function in (5) and (4) are set as 0.8 and 0.5, respectively. And each evaluation related to each walking pattern p lasts 10 seconds that is strictly controlled by the PC-104 clock of PKU-HR3. According to the discussion in subsection C of section II, in this research, each factor in the continuous walking pattern space P is divided into five levels by discretization, and a new temporary pattern space Pnew is obtained. Then a 4 five-level factors statistical multivariate analysis task is faced. Sequentially, under the idea from the orthogonal DOE, a L25(45) Taguchi's orthogonal arrays is achieved, based on which walking pattern points could be actively and easily selected. Here, L25(45), pt (t =1, …, 25) and D are omitted due to the limited paper space. C. Results and Discussion 1) Overall Performance The overall performance is in Figure 4, where, to explore its robustness, three different experimental results corresponding to three different walking ground surfaces are illustrated.
At the beginning of the experiment, the robot is located on a fixed place, and then the learning process is started by the remote logon laptop, thereafter the learning will be lasted under the control of PKU-HR3 robot itself till converged.
Figure 4. Overall performance under three different ground surfaces.
(a) Wood floor (b) Carpet surface (c) Ordinary floor. Figure 3. Experimental environments under three different ground surfaces.
From Figure 4, it can be seen that all results curves are converged consistently at 30 or so evaluations with high fitness scores, which outperform those traditional methods. As reported in literatures, the Sequential Surrogate Optimization requires about 50 evaluations [16], the Gaussian Process Regression requires about 120 evaluations [15], and the hill
climbing and policy gradient require about 200 evaluations [14]. And the maximal stable walking speed reaches 30cm/s or so, which is comparable to state-of-the-art walking speed for a 56 cm high humanoid robot. It is aware that due to the differences from robot platforms, dimension and type of the walking pattern, etc., above comparison could only be regarded as a reference. Yet, the highly convergent speed and good output demonstrated the proposed approach is a promising alternative. To further dig out the property of the proposed approach, the stabilities of walking patterns over the whole learning process are observed, as shown in Figure 5. Three curves are corresponding to three different walking surfaces, and the black solid square marked on each curve represents the stability measure of the pattern point on which the related learning is converged. From Figure 5, it can be seen that three stabilities are consistent in a level that apparently higher than their corresponding average levels, respectively. This means the output achieved by the proposed learning behaves stable, and further indicates that the modification with stability measure involved for the traditional used fitness function works well.
proposed learning framework could successfully be a substitution for the time-consuming hand-tuning method. 3) Comparison with Hill Climbing Learning As discussed before, theoretically, learning algorithms involved in computer simulation based method could be applied directly to a real robot to carry out online walking pattern learning. Here, hill climbing learning, a typical algorithm for simulation based method, is selected to construct an online learning in order to make a deep comparison under a same platform PKU-HR3. Considering that hill climbing learning is sensitive to its initial points, two experiments are performed under initialization with a random pattern and a hand-tuning pattern (as in TABLE II) respectively. The performance comparison is shown in Figure 6.
Figure 6. Performance comparison with hill climbing based online learning
Figure 5. Stability over the learning process under three ground surfaces.
The significant rapid convergent speed along with superior maximal walking speed and favorable stability of the proposed active online learning framework set itself up as a successful and promising alternative for bipedal walking pattern learning. 2) Comparison with Hand-tuning To observe the behavior of the proposed approach and the hand-tuning strategy, output patterns and performances on both walking speed and stability are compared, shown as in Table II. TABLE II.
PARAMETER VALUES BEFORE AND AFTER LEARNING
Parameters n_step_ph h_COM
h_foot w_feet Fitness Stablity
Hand-tuning
58
0.285
0.032
0.038
0.5780
-0.5497
Our method
65
0.320
0.027
0.066
0.8390
-0.4385
From Table II, it can be seen that both output patterns, as expressed by four columns from column 2 to 5, are of clear difference. And walking speed achieved by the proposed approach outperforms that by hand-tuning, as column 6 represented. Meanwhile, the output of our method behaves more stable, as last column shown. This indicates that the
From Figure 6, it can be seen that, as our method converged at about 30 evaluations, both hill climbing learning processes under different initializations are still in optimal searching stage. This finding agrees with what we have known that hill climbing algorithm usually converged at 200 evaluations or so. It also can be seen that the learning with hand-tuning initialization receives better performance than that under the random initialization, while rising trend of the latter is more visible than that of the former even though both trends appear rising slightly. To avoid the frequently mechanical wear, we have to stop the optimal searching for the hill climbing based online learning at 50 evaluations. Although at this stage, the maximal fitness values are increased slightly, they still yields to the score obtained via the proposed approach that converged at 30 evaluations or so. IV.
CONLUSION AND FUTURE WORK
In this paper, a novel framework for active online learning bipedal walking on a real robot is proposed. Based on the deep analysis for such a learning task, two basic necessary conditions are concluded. One is that the learning time complexity should be efficient enough so that to guarantee the learning process can be executed under a real-time online style. The other is that the convergent speed should be rapid enough so that to reduce the cost of repetitively mechanical wear. It turns out that the proposed approach completely covered both requirements. And on PKU-HR3 humanoid robot platform, a better walking pattern was successfully learned as experimental results shown.
The research leads to following outcomes. Firstly, the polynomial response surface based surrogate model takes into account the time complexity efficiency issue on both model construction and model optimizing, and successfully let it feasible that putting the learning process on a real physical robot in a real-time online style. Secondly, the orthogonal DOE based active learning strategy is significant in accelerating the convergence. It effectively reduces the evaluation times comparing to full scale test, and hence greatly saves the expensive cost of repetitively mechanical wear. Thirdly, the gradient ascent algorithm right after active learning based surrogate optimization turns out to be preferable in remedying the output of surrogate optimization that modeled by the efficient but somewhat rough polynomial response surface. It further improves the convergent speed by reducing the times of the evaluation comparing to sequential surrogate optimization or stochastic searching. Furthermore, the modified reward function successfully results in better output walking pattern, since the stability besides the walking speed is also taken into account as the learning objective. This proposed framework leaves rooms for future research in several aspects. As the dimension of the pattern space increasing, how to make the discretization and how to perform DOE are remain problems. More precise but efficient enough surrogate model will be a keep-investing topic. The gradient ascent learning algorithm employed in this research is quite simple, thus some recently developed machine learning algorithms that have received superior performance, such as Natural Actor-Critic [25], PI2 [26], etc., are worth investigating, which may bring new illumination. In addition, researches on automatic fixing the parameters of the presented fitness function, exploring more suitable feedback or statistic to dig out a more appropriate fitness function etc. are also our ongoing works. ACKNOWLEDGMENT The work was supported in part by the National Natural Science Foundation of China (No.90920302), a HGJ Grant of China (No. 2011ZX01042-001-001), and a research program from Microsoft China. Authors would great appreciate Professor Huisheng Chi for his helpful and beneficial suggestions. Authors would also great appreciate anonymous reviewers for their valuable comments. REFERENCES [1]
[2]
[3]
[4]
[5]
M. Vukobratovic and D. Juricic, "Contribution to the synthesis of biped gait," IEEE Transaction on Biomedical Engineering, vol. 16, no.1, pp.16, 1969. M. Vukobratovic and B. Borovac, “Zero-moment point – thirty five years of its life,” International Journal of Humanoid Robotics, vol. 1, no.1, pp. 157–173, 2004. H. Hemami, F. Weimer, and S. Koozekanani, "Some aspects of the inverted pendulum problem for modeling of locomotion systems," IEEE Transactions on Automatic Control, vol.18, no.6, pp. 658- 661, 1973. G. Taga, Y. Yamaguchi and H Shimizu, “Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment,” Biological Cybernetics, vol. 65, no.3, pp. 147-159, 1991. G. Taga, “A model of the neuro-musculo-skeletal system for anticipatory adjustment of human locomotion during obstacle avoidance,” Biological Cybernetics, vol. 78, no.1, pp. 9-17. 1998.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23] [24]
[25] [26]
J. Pratt, P. Dilworth, and G. Pratt, “Virtual model control of a bipedal walking robot,” IEEE International Conference on Robotics and Automation, vol.1, pp.193-198, 1997. B. Hengst, D. Ibbotson, S. B. Pham and C. Sammut, “Omnidirectional locomotion for quadruped robots,” RoboCup 2001: Robot Soccer World Cup V. Springer, pp. 368–373, 2002. M. S. Kim and W. Uther, “Automzaic gait optimization for quadruped robots,” Australasian Conference on Robotics and Automation, pp.1-9, 2003. M. Hebbel, W. Nistico and D. Fisseler, "Learning in a high dimensional space: fast omnidirectional quadrupedal locomotion," Lecture Notes in Computer Science, vol. 4434, pp. 314-321, 2007. M. J. Quinland, S. K. Chalup and R. H. Middleton, “Techniques for improving vision and locomotion on the sony aibo robot,” 2003 Australasian Conference on Robotics and Automation, 2003. N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion,” IEEE International Conference on Robotics and Automation, pp.2619-2624, 2004. C. Niehaus, T. Röfer and T. Laue, “Gait-optimization on a humanoid robot using particle swarm optimization,” The 2nd Workshop on Humanoid Soccer Robots at the IEEE-RAS International Conference on Humanoid Robots, 2007. F. Faber and S. Behnke, "Stochastic optimization of bipedal walking using gyro feedback and phase resetting," IEEE-RAS International Conference on Humanoid Robots, pp.203-209, 2007. N. Kohl and P. Stone. “Machine learning for fast quadrupedal locomotion,” The Nineteenth National Conference on Artificial Intelligence, pp. 611-616, 2004. D. Lizotte, T. Wang, M. Bowling and D. Schuurmans, “Automatic gait optimization with gaussian process regression,” International Joint Conference on Artificial Intelligence, pp.944-949, 2007. T. Hemker, M. Stelzer, O. von Stryk and H. Sakamoto, “Efficient walking speed optimization of a humanoid robot,” International Journal of Robotics Research, vol. 28, no. 2, pp. 303–314, 2009. H. Wang, S. P. Wang and M. M. Tomovic, “Modified sequential Kriging optimization for multidisciplinary complex product simulation,” Chinese Journal of Aeronautics, vol. 23, no. 5, pp. 616-622, 2010. P. Gill, W. Murray and M. Saunders, “User’s guide for SNOPT 7.1: a fortran package for large-scale nonlinear programming,” Report NA 052, Department of Mathematics, University of California at San Diego, 2006. L. G. Fonseca, H. J. C. Barbosa and A. C. C. Lemonge, “A similaritybased surrogate model for expensive evolutionary optimization with fixed budget of simulations,” The 11th Congress on Evolutionary Computation, pp. 867–874, 2009. A.F. Hernandez and M.G. Gallivan, “An exploratory study of discrete time state-space models using kriging,” American Control Conference, pp.3993-3998, 2008. T. Goel, R. Vaidyanathan, R. T. Haftka, et al., “Response surface approximation of Pareto optimal front in multi-objective optimization,” Computer Methods in Applied Mechanics and Engineering, vol.196, no. 4, pp.879-893, 2007. I. N. Vuchkov and L. N. Boyadjieva, “Quality improvement with design of experiments: a response surface approach,” Kluwer Academic Publishers. Dordrecht, 2001. D. C. Montgomery, “Design and Analysis of Experiments, 6th Edition,” John Wiley & Sons, Inc., 2005. E. Theodorou, J. Buchli and S. Schaal, "Learning policy improvements with path integrals," International Conference on Artificial Intelligence and Statistics, pp.828-835, 2010. J. Peters and S. Schaal, "Natural Actor-Critic," Neurocomputing vol.71, no.7-9, pp.1180-1190, 2008. M. Friedmann, J. Kiener, S. Petters, et al, "Versatile, high-quality motions and behavior control of humanoid soccer robots," IEEE-RAS International Conference on Humanoid Robots, pp.9-16, 2006.