International Journal of

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals 1

ZHU Xin-feng, 2 WANG Jian-dong, 3 LI Bin College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, [email protected] 2 College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, [email protected] 3 College of Information Technology, Yangzhou University, [email protected]

*1

doi:10.4156/jdcta.vol4. issue9.15

Abstract This paper proposes the application of Gaussian process regression (GPR) as an alternative regression model to resolve the hard overlapped electrochemical signals belonging to the 2,4,6trichlorophenol/2,6-dichlorophenol (TCP/DCP) system. Gaussian process derives from the perspective of Bayesian non-parametric regression methods, in terms of the parameterization of the covariance function, results in its good performance for the development of a calibration model for both linear and non-linear data sets. The multivariate regression model developed by GPR was compared with some traditional regression methods such as partial least squares regression (PLSR), and support vector regression (SVR). The comparative results were satisfied. The satisfactory results obtained through GPR method suggest that it can be used as a more effective and promising tool for multivariate regression tasks than the others.

Keywords: Multicomponent, Overlapped Electrochemical Signals, Multivariate Regression, PLSR, SVR, GPR

1. Introduction Chlorophenols (CPs) are common effluents in agrochemical, pulp and paper, pharmaceutical, and dyestuff industries. They are known to cause serious health problems to humans; hence these chemicals are found in the list of priority pollutants of the Environmental Protection Agency (EPA). The detection of them is an important environmental issue. Many methods have been applied for this purpose, such as mass spectroscopy [1], chemiluminescence reactions method [2] and so on. However, most of the methods were focused on the detection of only one kind of CPs, which may be due to the interference of each other. Complex products often contain sets of chlorophenols rather than a single substance and each compound possesses unique physical and chemical properties. Consequently, the quantitative analysis of multicomponent mixtures containing bundles of chlorophenols is a complicated analytical task of significant practical importance. The most common method for the determination of the mixtures of CPs is different types of chromatography [3]. However some difficulties still exist, such as the high price and the complicated procedure. Over the past decades, electrochemical method has seen tremendous evolution and been a major analytical technique with application in both industry and many research disciplines because of its simplicity, rapidity and sensitivity. Many electrochemical methods have been used for the detection of one kind of CPs [4][5].One of the main limitations to the application of electroanalytical techniques in the field of quantitative analysis is often due to the lack of selectivity for serious overlapping signals. It is important to develop reliable, accurate and available tool for the determination of CPs in multicomponent mixtures using electrochemical methods. Such possibility can be found in application of modern chemometrics approaches to resolution of overlapping electrochemical signals of complex mixtures. Amongst the most used chemometric techniques for simultaneous evaluation of overlapped signals, multivariate calibration [6][7] is widely used. Many successful applications of multivariate calibration based on different regression methods have been used: multiple linear regression (MLR) [8], principal component regression (PCR) [9][10], partial least squares regression (PLSR) [9][10][11] and support

- 123 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals ZHU Xin-feng, WANG Jian-dong, LI Bin

vector regression(SVR)[12][13]. MLR, as a most commonly used chemometric method, has the limitation for overlapping signals studies because it provides relatively poor accuracy. PLSR, a method to use dimension reduction methods together with linear regression, performs poorly in nonlinear relations. SVR, a possible huge advantage of it is its ability to model nonlinear relations, has been applied in various fields successfully. But SVR has not fully satisfactory probabilistic interpretation. As a result of its good performance in practice and desirable analytical properties, Gaussian process models have been widely applied, but to date Gaussian process has been rarely used in the chemometric community. Here, we demonstrated the feasibility and practical performance of a Gaussian process regression (GPR) method for the analysis of the mixture containing different kinds of CPs (TCP: 2,4,6-trichlorophenol, DCP: 2,6-Dichlorophenol) which is named as TCP/DCP system. Gaussian process as a chemometric methodology is compared with several most used regression methods such as PLSR and SVR, with the aim of predicting the concentration of TCP and DCP, using different peak parameters as input data: position, height, half width, derivative and area of the voltammetric peaks.

2. Fundamentals of Regression 2.1. Multilinear Regression (MLR) Traditional regression techniques have been based on Multilinear Regression methodologies. Consider the case where a regression model is to be developed from a set of n samples of analyte. Let xi denote the property of the i-th sample, a vector comprising values at p parameters that are measured, through the n samples we can get input matrix X=(x1,…, xn)T. Likewise, let yi be a q-dimensional vector, where q is the number of outputs , and we can get output matrix Y=(y1,…, yn)T. The regression task is thus to build a multivariate regression model, of the form Y=f(X).More specifically for linear regression: Y XB E

(1)

Where B is a matrix of regression coefficients, and E is the residual matrix. This linear regression model satisfies Beer-Lambert’s law [14] and the regression coefficients are calculated as follows: B (XT X)1 XT Y

(2)

MLR is a rather direct and simple algorithm, but if the multicollinearity exists between the columns of X, MLR analysis may leads to ill-posed inverse problem, which causes infeasibility in practice. MLR cannot tackle the nonlinear regression problem. Although MLR may be not useful in many domains, it helps us to understand enhanced linear regression methods such as PLSR.

2.2. PLSR PLSR is one of the commonest solutions to use dimension reduction methods together with linear regression. PLSR is a technique that generalizes and combines features from principal component analysis and multiple regressions. When X is likely to be singular and the regression approach is no longer feasible because of multicollinearity. PLSR decomposes both X and Y as a product of a common set of orthogonal factors and a set of specific loadings. An obvious question is to find the number of latent variables needed to obtain the best generalization for the prediction of new observations. This is, in general, achieved by cross-validation techniques such as bootstrapping.

2.3. SVR Support vector machines initially were developed by Vapnik [15][16]. For detailed in-depth theoretical background on SVMs for both classification and regression, the reader is referred to

- 124 -

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

Refs [17][18][19]. The idea of SVR is based on the computation of a linear regression function in a high dimensional feature space where the input data are mapped via a nonlinear function.

2.4. GPR Equation (1) can be generalized to a non-linear regression model with respect to the predictors x, however it will still be linear with respect to the basis function. M

yi m ( x i )b m ei

(3)

m1

The key assumption in GP modeling is that our data can be represented as a sample from a multivariate Gaussian distribution, supposing K**, K* and K are covariance matrixes, we have that

y* K** ～ N 0, T y K*

K* K

(4)

The conditional probability p(y*|y) follows a Gaussian distribution

y* y ~ N( K* K 1y,K** K* K 1K*T ) The best estimate for y is the mean of this distribution E( y* ) K*K 1y The uncertainty in our estimate is captured in its variance var( y* ) K** K*K 1K*T

3. Experimental 3.1. Materials and Instruments 2,4,6-trichlorophenol and 2,6-dichlorophenol were purchased from Yancheng Huaye Pharmaceutical & Chemical Factory (Yancheng, China). All the others chemicals were analytical grade and were used without further purification. The phosphate buffer solution (PBS) was prepared by 0.1 M NaH2PO4- Na2HPO4 solution. Double-distilled water was used for the preparation of all the solutions. All the electrochemical experiments were performed on a CHI 760D electrochemical workstation (Chenhua Instruments, Shanghai, China) in pH 3.0 PBS solution. The three-electrode system contains a glassy carbon electrode (GCE) as the working electrode (d=3 mm, Shanghai Chenhua, China), a saturated calomel electrode as the reference electrode and a platinum disk electrode (d=1 mm in diameter, Tianjing Lanlike, China) as the counter electrode, respectively.

3.2. DataSet The peak parameters of our TCP/DCP experimental system we selected are as follows: Peak potential (V): potential at which Peak intensity (I): maximum current (intensity) with respect to the baseline; Peak area (S): area of the peak corrected for the baseline (arbitrary units); Half width (W): the difference between the peak

- 125 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals ZHU Xin-feng, WANG Jian-dong, LI Bin

values of the maximum and the minimum in the first derivative of the peak (arbitrary units); Half-peak potential(Eh): means the half-peak electrode potential. The performance of each calibration model has been tested by the root mean square error (RMSE), estimated on two different test sets for all cases. This decision variable is defined by next equation:

(y

RM SE

i

y*i ) 2

(8)

i

n

Where yi represent the measured concentration for the ions; y*i represent the predicted concentration for the ions; and n is the number of samples. Figure 1 represents the overlapping voltammograms of the two compounds and their mixtures. The potential peaks (Ep) of each individual analyte are situated in a few mV, which implies a very severe grade of overlapping amongst the individual signals, as the unique peak in the voltammogram of the mixtures shows. The datasets of the voltammetric peak parameters: V, I, S, W, and D, as well as the concentration of TCP and DCP used in all the mixtures studied in our work are shown in Table 1. In our study, a common stochastic strategy is adopted where the random partitioning of training and test sets is repeated for several times. Thus the calibration methods are evaluated on each partition, and the average RMSE is used for comparison. We selected 42 rows for train and 6 rows for predict. The following three figures show the results of PLSR, SVR, GPR methods performed on the data for predict. Table 1. Values of the voltammetric peak parameters used in the present work Sample TCP DCP D I S W V Eh Sample TCP DCP D I S W V

Eh

T1

0.1

0

8.606

3.184

3.447

0.048

0.744

0.696

T3D3

0.3

0.3

51.98

18.72

20.54

0.044

0.744

0.7

T2

0.2

0

18.793

6.85

7.559

0.048

0.736

0.688

T8D3

0.8

0.3

80.52

29.8

34.14

0.048

0.736

0.688

T3

0.3

0

28.45

10.29

11.39

0.048

0.736

0.688

T4D4

0.4

0.4

67.3

24.3

26.76

0.048

0.744

0.696

T4

0.4

0

36.38

13.07

14.43

0.048

0.736

0.688

T9D4

0.9

0.4

101.51

37.66

43.15

0.048

0.736

0.688

T5

0.5

0

44.44

16.33

18.01

0.044

0.732

0.688

T5D5

0.5

0.5

80.5

29.09

32.15

0.048

0.744

0.696

T6

0.6

0

51.52

18.56

21.55

0.044

0.732

0.688

T10D5

1

0.5

119.96

44.63

51.18

0.052

0.74

0.688

T7

0.7

0

62.42

22.74

25.4

0.048

0.732

0.684

T1D6

0.1

0.6

54.37

18.45

18.85

0.044

0.744

0.7

T8

0.8

0

70.88

25.76

28.63

0.048

0.732

0.684

T6D6

0.6

0.6

92.87

34.76

40.67

0.048

0.744

0.696

T9

0.9

0

79.82

29.46

32.02

0.048

0.732

0.684

T2D7

0.2

0.7

65.19

22.73

23.77

0.048

0.748

0.7

T10

1

0

88.8

32.59

36.48

0.048

0.732

0.684

T7D7

0.7

0.7

109.37

40.52

45.67

0.048

0.744

0.696

D1

0

0.1

13.47

3.37

3.008

0.04

0.752

0.712

T3D8

0.3

0.8

75.67

26.77

28.44

0.048

0.748

0.7

D2

0

0.2

18.369

5.594

5.448

0.04

0.752

0.712

T8D8

0.8

0.8

122.55

45.72

52.17

0.048

0.744

0.696

D3

0

0.3

24.29

8.06

7.929

0.04

0.752

0.712

T4D9

0.4

0.9

88.04

31.77

34.39

0.048

0.748

0.7

D4

0

0.4

28.71

9.954

9.647

0.044

0.756

0.712

T9D9

0.9

0.9

136.8

52.68

59.72

0.048

0.744

0.696

D5

0

0.5

35.44

12.28

12.57

0.04

0.752

0.712

T5D10

0.5

1

99.95

36.77

40.66

0.048

0.748

0.7

D6

0

0.6

39.8

14.05

15.14

0.044

0.756

0.712

T10D10

1

1

151.3

58.23

68.52

0.052

0.744

0.692

D7

0

0.7

45.72

16.21

17.8

0.044

0.756

0.712

T1D9

0.1

0.9

74.73

25.92

27.97

0.044

0.752

0.708

D8

0

0.8

49.68

18.77

20.2

0.044

0.756

0.712

T9D1

0.9

0.1

60.17

23.97

30.51

0.052

0.708

0.656

D9

0

0.9

55.03

21.14

22.46

0.044

0.756

0.712

T3D5

0.3

0.5

50.75

19.48

23.24

0.048

0.744

0.696

D10

0

1

59.17

23.37

25.51

0.052

0.764

0.712

T5D3

0.5

0.3

32.2

12.18

14.39

0.048

0.72

0.672

T1D1

0.1

0.1

19.314

5.551

6.186

0.044

0.748

0.704

T6D8

0.6

0.8

84.85

32.57

38.25

0.052

0.744

0.692

T6D1

0.6

0.1

45.12

17.35

20.95

0.052

0.728

0.676

T8D6

0.8

0.6

86.12

32.23

37.12

0.048

0.74

0.692

T2D2

0.2

0.2

34.14

12.17

13.36

0.044

0.744

0.7

T10D2

1

0.2

67.18

26.04

32

0.048

0.732

0.684

T7D2

0.7

0.2

62.81

23.63

27.54

0.048

0.732

0.684

T2D10

0.2

1

78.13

28.34

31.6

0.048

0.736

0.688

V=peak potential (V); I=peak intensity (10 -7 A); S=area of the peaks (×10 −8 cm2 ); W=half width (V); D=derivative (×10 −6 cm2 .); T: 2,4,6-TCP; D:2,6-DCP; 1, 2, …, 10=0.1, 0.2, …, 1.0μM

- 126 -

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

Figure 1. Differential pulsed voltammograms of 2.0 μM DCP(a), 2.0 μM TCP (b) and the mixture of 2.0 μM DCP and 2.0 μM TCP in pH= 2.0 solution.

3.3. PLSR PLSR calculations were performed using iToolbox which can provid interval PLS (iPLS), backward interval PLS, moving windos PLS etc. PLS model was built using the iPLS Toolbox. We used iPLS to train and predict our experimental data. From Figure 2 we can see that there are little difference between 4 PLS components and 5 PLS components, so the number of significant components (latent variables, LVs) chosen in our model was 4. The predicted versus measured plot for the test data using PLSR method is shown in Fig ure 3.

Figure 2. RMSECV of different latent variables

Figure 3. Predicted vs. measured plot for PLSR RMSE-TCP=0.0913 RMSE-DCP=0.1695

- 127 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals ZHU Xin-feng, WANG Jian-dong, LI Bin

3.4. SVR SVR calculations were performed using libSVM. To avoid attributes in greater numeric ranges dominating those in smaller numeric ranges, scaling before applying SVM is very important. Scaling can also avoid numerical difficulties during the calculation. We linearly scaled each attribute to the range [-1, +1] for training data, and used the same scaled regulation to scale the testing data. In general, the RBF kernel is a reasonable first choice. We chose the RBF as kernel function. There are two parameters for an RBF kernel: C and . To get good C and , we used the cross-validation method to train for better model. In v-fold cross-validation, the training set into v subsets of equal size. Sequentially one subset is tested using the model trained on the remaining v-1 subsets. In our work, we used 5-fold cross-validation to get the best hyper-parameters. Fine selection of kernel and hyper-parameters is a key issue in SVR and it is still valuable to research on such a selection. The predicted versus measured plot for the test data using SVR method is shown in Figure 4.

Figure 4. Predicted vs. measured plot for SVR RMSE-TCP=0.0445 RMSE-DCP =0.0934

3.5. GPR The software used in this paper was gpml-matlab. In the first the response variables were transformed to have zero mean before the data is used for training a Gaussian process [20]. The rationale for this is that if the mean of the responses moves significantly away from zero, the constant bias (offset) term in the covariance function will become relatively large, and thus the resultant covariance matrix will have a large condition number [20] and consequently the precision of the numerical inversion of the covariance matr ix will degrade significantly. The predicted versus measured plot for the test data using GPR method is sh own in Figure 5.

Figure 5. Predicted vs. measured plot for GPR RMSE-TCP=0.0415 RMSE-DCP =0.091

- 128 -

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

4. Discussions 4.1. Regression Effect Comparation Figure 6 gives the comparison of results between three different regression methods. In general, the goal of this paper was to present GPR as attractive regression method to resolve overlapped electrochemical signals. PLSR performed reasonably well on our data set, implying a strong linear relationship between the measured parameters and the concentrations of TCP/DCP. SVR and GPR give significantly lower RMSE than PLSR. The better result from the SVR and GPR implies that there may be weak non-linearity in this data set. Theoretically, the use of SVR and GPR is very advantageous because they can perform nonlinear regression efficiently for high dimensional data sets. This means that implicit modeling of nonlinear interferents by SVR and GPR is done very well. Motivated by the results in kernel-based learning and support vector machines the nonlinear kernel-based PLS methodology was proposed by Rosipal and Trejo [21].

Figure 6. RMSE compare between three methods Figure 6 showes that the Gaussian process achieves further improvement over the SVR. GPR has some advantages over other kernel methods for regression because they are fully statistical models. The important advantage of regression with GPR over other non-Bayesian models is the explicit probabilistic formulation. This not only builds the ability to infer hyper-parameters in Bayesian framework but also provides confidence intervals in prediction.

4.2. Sparse approaches 3

GPR and SVR both have complexity of O(n ) , but SVR is usually faster since it uses a sparse scheme. In practice, SVR with cross-validation is slow too. However, the inversion of the covariance matrix, whose size is equal to the number of training samples, must be carried out when the hyperparameters are being adapted. The computational cost of this approach for large data set is very expensive. This drawback of GPR models makes it difficult to deal with over one thousand training samples. Sparse approaches for GPR to reduce the computation time developed [22] could be applied for our purpose. Combined with sparse versions of GPR and parameterized kernels, this could provide a very powerful general regression system.

4.3. Multi-output GPR The dataset involves a multi-component calibration problem, that is, the need to determine multiple properties (response variables) of the analyte. The study reported in this paper about GPR adopts a simplified solution which models each response independently. It is possible to define a Gaussian process with multiple response variables, Modeling multiple output variables is a challenge as we are required to compute cross covariance between the different outputs, Boyle and Frean [23] applied the

- 129 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals ZHU Xin-feng, WANG Jian-dong, LI Bin

convolution process formalism to establish dependencies between output variables, where each latent function is represented as a Gaussian process. Dependent Gaussian processes should be particularly valuable in cases where one output is expensive to sample, but covaries strongly with a second that is cheap. By inferring both the coupling and the independent aspects of the data, the cheap observations can be used as a proxy for the expensive ones. This is an important aspect for future research.

5. Conclusions and Future Work Here, GPR has been extensively used to provide the quantitative information of the mixture containing TCP and DCP. A set of methods have been employed for comparasion. Our results show that GPR is comparable and in some cases outperforms PLSR and SVR methods. It has been proved capable of providing reliable results in a well-defined accuracy range of a bundle of CPs in a single analytical run. The proposed method is a promising analytical tool to improve the reliability of determination of different kinds of mixtures in environment monitoring and analytical laboratories. In the future work, we will study the more effective method to improve the regression results, mainly focus on the sparse approaches on multi-output Gaussian process regression research for multi-component analysis.

6. Acknowledgement This work is supported by the National Natural Science Foundation of China (No.61070133), the Natural Science and Technology Foundation of Jiangsu Province of China (No.BK210311) and the Natural Science Foundation of Jiangsu Province of China (No. BK2009697).

7. References Eberlin MN, da Silva RC, Faster and simpler determination of chlorophenols in water by fiber introduction mass spectrometry[J], Analytica Chimica Acta., 2008,620(1-2):97-102, [2] Zhang JL, Song QJ, Hu X, Zhang EH, Gao H, Dye-sensitized phototrans formation of chlorophenols and their subsequent chemiluminescence reactions[J], Journal of Luminescence, 2008, 128(12):1880-1885, [3] Higashi Y, Fujii Y , HPLC-UV Analysis of Phenol and Chlorophenols in Water After Precolumn Derivatization with 4-Fluoro-7-nitro-2,1,3-benzoxadiazole[J], Journal of liquid chromatography & related technologies, 2009,32(16): 2372-2383 , [4] Diaz-Diaz G, Blanco-Lopez MC, Lobo-Castanon MJ, Miranda-Ordieres AJ, Tunon-Blanco P, Chloroperoxidase Modified Electrode for Amperometric Determination of 2,4,6Trichlorophenol[J], Electroanalysis, 2009,21(12):1348-1353 [5] El Ichi S, Marzouki MN, Korri-Youssoufi H, Direct monitoring of pollutants based on an electrochemical biosensor with novel peroxidase (POX1B)[J], Biosens. Bioelectron., 2009, 24(10):3084-3090, [6] Brereton R.G., Introduction to multivariate calibration in analytical chemistry[J], Analyst, 2000,125(11):2125–2154 [7] Geladi P., Some recent trends in the calibration literature[J],Chemom. Intell. Lab. Syst, 2002, 60(1-2):211–224 [8] K. Bessant, S. Saini, A chemometric analysis of dual pulse staircase voltammograms obtained in mixtures of ethanol, fructose and glucose , Journal of electroanalytical chemistry, 2000,489(12):76–83 [9] J. Saurina, S.H. Cassou, E. Fabregas, S. Alegret, Cyclic voltammetric simultaneous determination of oxidizable amino acids using multivariate calibration methods[J], Analytica Chimica Acta, 2000,405(1-2):153–160 [10] Y. Ni, L. Wang, S. Kokot, Simultaneous determination of nitrobenzene and nitro-substituted phenols by differential pulse voltammetry and chemometrics[J] ,Analytica Chimica Acta,2001, 431(1):101–113 [11] A. Herrero, M.C. Ortiz, Modelling the background current with partial least squares regression [1]

- 130 -

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

[12] [13] [14] [15] [16] [17] [18] [19] [20]

[21] [22] [23]

and transference of the calibration models in the simultaneous determination of Tl and Pb by stripping voltammetry [J],Talanta, 1998,46(1):129–138 B. Scholkopf, A.J. Smola, Learning with Kernels: Support Vector Machines,Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, USA,2001 T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vector machines[J], Advances in Computational Mathematics, 2000,13(1):1–50 Peter Lykos, The Beer-Lambert law revisited: A development without calculus [J], Journal of Chemical Education, 1992, 69 (9):730 V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag,New York, USA, 1995 V. Vapnik, Statistical Learning Theory, Wiley, New York, USA, 1998 N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernelbased Learning Methods, Cambridge Univ. Press, Cambridge, UK, 2000 B. Schölkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge,UK, 2002 J.A.K. Suykens, T. van Gestel, J. de Brabanter, B. de Moor, J.Vandewalle, Least Squares Support Vector Machines, World Scientific,Singapore, 2002 R.M. Neal, Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report No. 9702,Department of Statistics, University of Toronto, Canada, 1997. R. Rosipal and L.J. Trejo. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space[J], Journal of Machine Learning Research, 2001,2: 97-123 Csato L, Opper M ,Sparse representation for Gaussian process models[J], Advances in Neural Information Processing Systems, 2001,13:444-450 P. Boyle and M. Frean. Dependent Gaussian processes, Advances in Neural Information Processing Systems,The MIT Press ,2005, 17:217-224

- 131 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals 1

ZHU Xin-feng, 2 WANG Jian-dong, 3 LI Bin College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, [email protected] 2 College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, [email protected] 3 College of Information Technology, Yangzhou University, [email protected]

*1

doi:10.4156/jdcta.vol4. issue9.15

Abstract This paper proposes the application of Gaussian process regression (GPR) as an alternative regression model to resolve the hard overlapped electrochemical signals belonging to the 2,4,6trichlorophenol/2,6-dichlorophenol (TCP/DCP) system. Gaussian process derives from the perspective of Bayesian non-parametric regression methods, in terms of the parameterization of the covariance function, results in its good performance for the development of a calibration model for both linear and non-linear data sets. The multivariate regression model developed by GPR was compared with some traditional regression methods such as partial least squares regression (PLSR), and support vector regression (SVR). The comparative results were satisfied. The satisfactory results obtained through GPR method suggest that it can be used as a more effective and promising tool for multivariate regression tasks than the others.

Keywords: Multicomponent, Overlapped Electrochemical Signals, Multivariate Regression, PLSR, SVR, GPR

1. Introduction Chlorophenols (CPs) are common effluents in agrochemical, pulp and paper, pharmaceutical, and dyestuff industries. They are known to cause serious health problems to humans; hence these chemicals are found in the list of priority pollutants of the Environmental Protection Agency (EPA). The detection of them is an important environmental issue. Many methods have been applied for this purpose, such as mass spectroscopy [1], chemiluminescence reactions method [2] and so on. However, most of the methods were focused on the detection of only one kind of CPs, which may be due to the interference of each other. Complex products often contain sets of chlorophenols rather than a single substance and each compound possesses unique physical and chemical properties. Consequently, the quantitative analysis of multicomponent mixtures containing bundles of chlorophenols is a complicated analytical task of significant practical importance. The most common method for the determination of the mixtures of CPs is different types of chromatography [3]. However some difficulties still exist, such as the high price and the complicated procedure. Over the past decades, electrochemical method has seen tremendous evolution and been a major analytical technique with application in both industry and many research disciplines because of its simplicity, rapidity and sensitivity. Many electrochemical methods have been used for the detection of one kind of CPs [4][5].One of the main limitations to the application of electroanalytical techniques in the field of quantitative analysis is often due to the lack of selectivity for serious overlapping signals. It is important to develop reliable, accurate and available tool for the determination of CPs in multicomponent mixtures using electrochemical methods. Such possibility can be found in application of modern chemometrics approaches to resolution of overlapping electrochemical signals of complex mixtures. Amongst the most used chemometric techniques for simultaneous evaluation of overlapped signals, multivariate calibration [6][7] is widely used. Many successful applications of multivariate calibration based on different regression methods have been used: multiple linear regression (MLR) [8], principal component regression (PCR) [9][10], partial least squares regression (PLSR) [9][10][11] and support

- 123 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals ZHU Xin-feng, WANG Jian-dong, LI Bin

vector regression(SVR)[12][13]. MLR, as a most commonly used chemometric method, has the limitation for overlapping signals studies because it provides relatively poor accuracy. PLSR, a method to use dimension reduction methods together with linear regression, performs poorly in nonlinear relations. SVR, a possible huge advantage of it is its ability to model nonlinear relations, has been applied in various fields successfully. But SVR has not fully satisfactory probabilistic interpretation. As a result of its good performance in practice and desirable analytical properties, Gaussian process models have been widely applied, but to date Gaussian process has been rarely used in the chemometric community. Here, we demonstrated the feasibility and practical performance of a Gaussian process regression (GPR) method for the analysis of the mixture containing different kinds of CPs (TCP: 2,4,6-trichlorophenol, DCP: 2,6-Dichlorophenol) which is named as TCP/DCP system. Gaussian process as a chemometric methodology is compared with several most used regression methods such as PLSR and SVR, with the aim of predicting the concentration of TCP and DCP, using different peak parameters as input data: position, height, half width, derivative and area of the voltammetric peaks.

2. Fundamentals of Regression 2.1. Multilinear Regression (MLR) Traditional regression techniques have been based on Multilinear Regression methodologies. Consider the case where a regression model is to be developed from a set of n samples of analyte. Let xi denote the property of the i-th sample, a vector comprising values at p parameters that are measured, through the n samples we can get input matrix X=(x1,…, xn)T. Likewise, let yi be a q-dimensional vector, where q is the number of outputs , and we can get output matrix Y=(y1,…, yn)T. The regression task is thus to build a multivariate regression model, of the form Y=f(X).More specifically for linear regression: Y XB E

(1)

Where B is a matrix of regression coefficients, and E is the residual matrix. This linear regression model satisfies Beer-Lambert’s law [14] and the regression coefficients are calculated as follows: B (XT X)1 XT Y

(2)

MLR is a rather direct and simple algorithm, but if the multicollinearity exists between the columns of X, MLR analysis may leads to ill-posed inverse problem, which causes infeasibility in practice. MLR cannot tackle the nonlinear regression problem. Although MLR may be not useful in many domains, it helps us to understand enhanced linear regression methods such as PLSR.

2.2. PLSR PLSR is one of the commonest solutions to use dimension reduction methods together with linear regression. PLSR is a technique that generalizes and combines features from principal component analysis and multiple regressions. When X is likely to be singular and the regression approach is no longer feasible because of multicollinearity. PLSR decomposes both X and Y as a product of a common set of orthogonal factors and a set of specific loadings. An obvious question is to find the number of latent variables needed to obtain the best generalization for the prediction of new observations. This is, in general, achieved by cross-validation techniques such as bootstrapping.

2.3. SVR Support vector machines initially were developed by Vapnik [15][16]. For detailed in-depth theoretical background on SVMs for both classification and regression, the reader is referred to

- 124 -

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

Refs [17][18][19]. The idea of SVR is based on the computation of a linear regression function in a high dimensional feature space where the input data are mapped via a nonlinear function.

2.4. GPR Equation (1) can be generalized to a non-linear regression model with respect to the predictors x, however it will still be linear with respect to the basis function. M

yi m ( x i )b m ei

(3)

m1

The key assumption in GP modeling is that our data can be represented as a sample from a multivariate Gaussian distribution, supposing K**, K* and K are covariance matrixes, we have that

y* K** ～ N 0, T y K*

K* K

(4)

The conditional probability p(y*|y) follows a Gaussian distribution

y* y ~ N( K* K 1y,K** K* K 1K*T ) The best estimate for y is the mean of this distribution E( y* ) K*K 1y The uncertainty in our estimate is captured in its variance var( y* ) K** K*K 1K*T

3. Experimental 3.1. Materials and Instruments 2,4,6-trichlorophenol and 2,6-dichlorophenol were purchased from Yancheng Huaye Pharmaceutical & Chemical Factory (Yancheng, China). All the others chemicals were analytical grade and were used without further purification. The phosphate buffer solution (PBS) was prepared by 0.1 M NaH2PO4- Na2HPO4 solution. Double-distilled water was used for the preparation of all the solutions. All the electrochemical experiments were performed on a CHI 760D electrochemical workstation (Chenhua Instruments, Shanghai, China) in pH 3.0 PBS solution. The three-electrode system contains a glassy carbon electrode (GCE) as the working electrode (d=3 mm, Shanghai Chenhua, China), a saturated calomel electrode as the reference electrode and a platinum disk electrode (d=1 mm in diameter, Tianjing Lanlike, China) as the counter electrode, respectively.

3.2. DataSet The peak parameters of our TCP/DCP experimental system we selected are as follows: Peak potential (V): potential at which Peak intensity (I): maximum current (intensity) with respect to the baseline; Peak area (S): area of the peak corrected for the baseline (arbitrary units); Half width (W): the difference between the peak

- 125 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals ZHU Xin-feng, WANG Jian-dong, LI Bin

values of the maximum and the minimum in the first derivative of the peak (arbitrary units); Half-peak potential(Eh): means the half-peak electrode potential. The performance of each calibration model has been tested by the root mean square error (RMSE), estimated on two different test sets for all cases. This decision variable is defined by next equation:

(y

RM SE

i

y*i ) 2

(8)

i

n

Where yi represent the measured concentration for the ions; y*i represent the predicted concentration for the ions; and n is the number of samples. Figure 1 represents the overlapping voltammograms of the two compounds and their mixtures. The potential peaks (Ep) of each individual analyte are situated in a few mV, which implies a very severe grade of overlapping amongst the individual signals, as the unique peak in the voltammogram of the mixtures shows. The datasets of the voltammetric peak parameters: V, I, S, W, and D, as well as the concentration of TCP and DCP used in all the mixtures studied in our work are shown in Table 1. In our study, a common stochastic strategy is adopted where the random partitioning of training and test sets is repeated for several times. Thus the calibration methods are evaluated on each partition, and the average RMSE is used for comparison. We selected 42 rows for train and 6 rows for predict. The following three figures show the results of PLSR, SVR, GPR methods performed on the data for predict. Table 1. Values of the voltammetric peak parameters used in the present work Sample TCP DCP D I S W V Eh Sample TCP DCP D I S W V

Eh

T1

0.1

0

8.606

3.184

3.447

0.048

0.744

0.696

T3D3

0.3

0.3

51.98

18.72

20.54

0.044

0.744

0.7

T2

0.2

0

18.793

6.85

7.559

0.048

0.736

0.688

T8D3

0.8

0.3

80.52

29.8

34.14

0.048

0.736

0.688

T3

0.3

0

28.45

10.29

11.39

0.048

0.736

0.688

T4D4

0.4

0.4

67.3

24.3

26.76

0.048

0.744

0.696

T4

0.4

0

36.38

13.07

14.43

0.048

0.736

0.688

T9D4

0.9

0.4

101.51

37.66

43.15

0.048

0.736

0.688

T5

0.5

0

44.44

16.33

18.01

0.044

0.732

0.688

T5D5

0.5

0.5

80.5

29.09

32.15

0.048

0.744

0.696

T6

0.6

0

51.52

18.56

21.55

0.044

0.732

0.688

T10D5

1

0.5

119.96

44.63

51.18

0.052

0.74

0.688

T7

0.7

0

62.42

22.74

25.4

0.048

0.732

0.684

T1D6

0.1

0.6

54.37

18.45

18.85

0.044

0.744

0.7

T8

0.8

0

70.88

25.76

28.63

0.048

0.732

0.684

T6D6

0.6

0.6

92.87

34.76

40.67

0.048

0.744

0.696

T9

0.9

0

79.82

29.46

32.02

0.048

0.732

0.684

T2D7

0.2

0.7

65.19

22.73

23.77

0.048

0.748

0.7

T10

1

0

88.8

32.59

36.48

0.048

0.732

0.684

T7D7

0.7

0.7

109.37

40.52

45.67

0.048

0.744

0.696

D1

0

0.1

13.47

3.37

3.008

0.04

0.752

0.712

T3D8

0.3

0.8

75.67

26.77

28.44

0.048

0.748

0.7

D2

0

0.2

18.369

5.594

5.448

0.04

0.752

0.712

T8D8

0.8

0.8

122.55

45.72

52.17

0.048

0.744

0.696

D3

0

0.3

24.29

8.06

7.929

0.04

0.752

0.712

T4D9

0.4

0.9

88.04

31.77

34.39

0.048

0.748

0.7

D4

0

0.4

28.71

9.954

9.647

0.044

0.756

0.712

T9D9

0.9

0.9

136.8

52.68

59.72

0.048

0.744

0.696

D5

0

0.5

35.44

12.28

12.57

0.04

0.752

0.712

T5D10

0.5

1

99.95

36.77

40.66

0.048

0.748

0.7

D6

0

0.6

39.8

14.05

15.14

0.044

0.756

0.712

T10D10

1

1

151.3

58.23

68.52

0.052

0.744

0.692

D7

0

0.7

45.72

16.21

17.8

0.044

0.756

0.712

T1D9

0.1

0.9

74.73

25.92

27.97

0.044

0.752

0.708

D8

0

0.8

49.68

18.77

20.2

0.044

0.756

0.712

T9D1

0.9

0.1

60.17

23.97

30.51

0.052

0.708

0.656

D9

0

0.9

55.03

21.14

22.46

0.044

0.756

0.712

T3D5

0.3

0.5

50.75

19.48

23.24

0.048

0.744

0.696

D10

0

1

59.17

23.37

25.51

0.052

0.764

0.712

T5D3

0.5

0.3

32.2

12.18

14.39

0.048

0.72

0.672

T1D1

0.1

0.1

19.314

5.551

6.186

0.044

0.748

0.704

T6D8

0.6

0.8

84.85

32.57

38.25

0.052

0.744

0.692

T6D1

0.6

0.1

45.12

17.35

20.95

0.052

0.728

0.676

T8D6

0.8

0.6

86.12

32.23

37.12

0.048

0.74

0.692

T2D2

0.2

0.2

34.14

12.17

13.36

0.044

0.744

0.7

T10D2

1

0.2

67.18

26.04

32

0.048

0.732

0.684

T7D2

0.7

0.2

62.81

23.63

27.54

0.048

0.732

0.684

T2D10

0.2

1

78.13

28.34

31.6

0.048

0.736

0.688

V=peak potential (V); I=peak intensity (10 -7 A); S=area of the peaks (×10 −8 cm2 ); W=half width (V); D=derivative (×10 −6 cm2 .); T: 2,4,6-TCP; D:2,6-DCP; 1, 2, …, 10=0.1, 0.2, …, 1.0μM

- 126 -

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

Figure 1. Differential pulsed voltammograms of 2.0 μM DCP(a), 2.0 μM TCP (b) and the mixture of 2.0 μM DCP and 2.0 μM TCP in pH= 2.0 solution.

3.3. PLSR PLSR calculations were performed using iToolbox which can provid interval PLS (iPLS), backward interval PLS, moving windos PLS etc. PLS model was built using the iPLS Toolbox. We used iPLS to train and predict our experimental data. From Figure 2 we can see that there are little difference between 4 PLS components and 5 PLS components, so the number of significant components (latent variables, LVs) chosen in our model was 4. The predicted versus measured plot for the test data using PLSR method is shown in Fig ure 3.

Figure 2. RMSECV of different latent variables

Figure 3. Predicted vs. measured plot for PLSR RMSE-TCP=0.0913 RMSE-DCP=0.1695

- 127 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals ZHU Xin-feng, WANG Jian-dong, LI Bin

3.4. SVR SVR calculations were performed using libSVM. To avoid attributes in greater numeric ranges dominating those in smaller numeric ranges, scaling before applying SVM is very important. Scaling can also avoid numerical difficulties during the calculation. We linearly scaled each attribute to the range [-1, +1] for training data, and used the same scaled regulation to scale the testing data. In general, the RBF kernel is a reasonable first choice. We chose the RBF as kernel function. There are two parameters for an RBF kernel: C and . To get good C and , we used the cross-validation method to train for better model. In v-fold cross-validation, the training set into v subsets of equal size. Sequentially one subset is tested using the model trained on the remaining v-1 subsets. In our work, we used 5-fold cross-validation to get the best hyper-parameters. Fine selection of kernel and hyper-parameters is a key issue in SVR and it is still valuable to research on such a selection. The predicted versus measured plot for the test data using SVR method is shown in Figure 4.

Figure 4. Predicted vs. measured plot for SVR RMSE-TCP=0.0445 RMSE-DCP =0.0934

3.5. GPR The software used in this paper was gpml-matlab. In the first the response variables were transformed to have zero mean before the data is used for training a Gaussian process [20]. The rationale for this is that if the mean of the responses moves significantly away from zero, the constant bias (offset) term in the covariance function will become relatively large, and thus the resultant covariance matrix will have a large condition number [20] and consequently the precision of the numerical inversion of the covariance matr ix will degrade significantly. The predicted versus measured plot for the test data using GPR method is sh own in Figure 5.

Figure 5. Predicted vs. measured plot for GPR RMSE-TCP=0.0415 RMSE-DCP =0.091

- 128 -

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

4. Discussions 4.1. Regression Effect Comparation Figure 6 gives the comparison of results between three different regression methods. In general, the goal of this paper was to present GPR as attractive regression method to resolve overlapped electrochemical signals. PLSR performed reasonably well on our data set, implying a strong linear relationship between the measured parameters and the concentrations of TCP/DCP. SVR and GPR give significantly lower RMSE than PLSR. The better result from the SVR and GPR implies that there may be weak non-linearity in this data set. Theoretically, the use of SVR and GPR is very advantageous because they can perform nonlinear regression efficiently for high dimensional data sets. This means that implicit modeling of nonlinear interferents by SVR and GPR is done very well. Motivated by the results in kernel-based learning and support vector machines the nonlinear kernel-based PLS methodology was proposed by Rosipal and Trejo [21].

Figure 6. RMSE compare between three methods Figure 6 showes that the Gaussian process achieves further improvement over the SVR. GPR has some advantages over other kernel methods for regression because they are fully statistical models. The important advantage of regression with GPR over other non-Bayesian models is the explicit probabilistic formulation. This not only builds the ability to infer hyper-parameters in Bayesian framework but also provides confidence intervals in prediction.

4.2. Sparse approaches 3

GPR and SVR both have complexity of O(n ) , but SVR is usually faster since it uses a sparse scheme. In practice, SVR with cross-validation is slow too. However, the inversion of the covariance matrix, whose size is equal to the number of training samples, must be carried out when the hyperparameters are being adapted. The computational cost of this approach for large data set is very expensive. This drawback of GPR models makes it difficult to deal with over one thousand training samples. Sparse approaches for GPR to reduce the computation time developed [22] could be applied for our purpose. Combined with sparse versions of GPR and parameterized kernels, this could provide a very powerful general regression system.

4.3. Multi-output GPR The dataset involves a multi-component calibration problem, that is, the need to determine multiple properties (response variables) of the analyte. The study reported in this paper about GPR adopts a simplified solution which models each response independently. It is possible to define a Gaussian process with multiple response variables, Modeling multiple output variables is a challenge as we are required to compute cross covariance between the different outputs, Boyle and Frean [23] applied the

- 129 -

Using Multivariate Regression Methods to Resolve Overlapped Electrochemical Signals ZHU Xin-feng, WANG Jian-dong, LI Bin

convolution process formalism to establish dependencies between output variables, where each latent function is represented as a Gaussian process. Dependent Gaussian processes should be particularly valuable in cases where one output is expensive to sample, but covaries strongly with a second that is cheap. By inferring both the coupling and the independent aspects of the data, the cheap observations can be used as a proxy for the expensive ones. This is an important aspect for future research.

5. Conclusions and Future Work Here, GPR has been extensively used to provide the quantitative information of the mixture containing TCP and DCP. A set of methods have been employed for comparasion. Our results show that GPR is comparable and in some cases outperforms PLSR and SVR methods. It has been proved capable of providing reliable results in a well-defined accuracy range of a bundle of CPs in a single analytical run. The proposed method is a promising analytical tool to improve the reliability of determination of different kinds of mixtures in environment monitoring and analytical laboratories. In the future work, we will study the more effective method to improve the regression results, mainly focus on the sparse approaches on multi-output Gaussian process regression research for multi-component analysis.

6. Acknowledgement This work is supported by the National Natural Science Foundation of China (No.61070133), the Natural Science and Technology Foundation of Jiangsu Province of China (No.BK210311) and the Natural Science Foundation of Jiangsu Province of China (No. BK2009697).

7. References Eberlin MN, da Silva RC, Faster and simpler determination of chlorophenols in water by fiber introduction mass spectrometry[J], Analytica Chimica Acta., 2008,620(1-2):97-102, [2] Zhang JL, Song QJ, Hu X, Zhang EH, Gao H, Dye-sensitized phototrans formation of chlorophenols and their subsequent chemiluminescence reactions[J], Journal of Luminescence, 2008, 128(12):1880-1885, [3] Higashi Y, Fujii Y , HPLC-UV Analysis of Phenol and Chlorophenols in Water After Precolumn Derivatization with 4-Fluoro-7-nitro-2,1,3-benzoxadiazole[J], Journal of liquid chromatography & related technologies, 2009,32(16): 2372-2383 , [4] Diaz-Diaz G, Blanco-Lopez MC, Lobo-Castanon MJ, Miranda-Ordieres AJ, Tunon-Blanco P, Chloroperoxidase Modified Electrode for Amperometric Determination of 2,4,6Trichlorophenol[J], Electroanalysis, 2009,21(12):1348-1353 [5] El Ichi S, Marzouki MN, Korri-Youssoufi H, Direct monitoring of pollutants based on an electrochemical biosensor with novel peroxidase (POX1B)[J], Biosens. Bioelectron., 2009, 24(10):3084-3090, [6] Brereton R.G., Introduction to multivariate calibration in analytical chemistry[J], Analyst, 2000,125(11):2125–2154 [7] Geladi P., Some recent trends in the calibration literature[J],Chemom. Intell. Lab. Syst, 2002, 60(1-2):211–224 [8] K. Bessant, S. Saini, A chemometric analysis of dual pulse staircase voltammograms obtained in mixtures of ethanol, fructose and glucose , Journal of electroanalytical chemistry, 2000,489(12):76–83 [9] J. Saurina, S.H. Cassou, E. Fabregas, S. Alegret, Cyclic voltammetric simultaneous determination of oxidizable amino acids using multivariate calibration methods[J], Analytica Chimica Acta, 2000,405(1-2):153–160 [10] Y. Ni, L. Wang, S. Kokot, Simultaneous determination of nitrobenzene and nitro-substituted phenols by differential pulse voltammetry and chemometrics[J] ,Analytica Chimica Acta,2001, 431(1):101–113 [11] A. Herrero, M.C. Ortiz, Modelling the background current with partial least squares regression [1]

- 130 -

International Journal of Digital Content Technology and its Applications Volume 4, Number 9, December 2010

[12] [13] [14] [15] [16] [17] [18] [19] [20]

[21] [22] [23]

and transference of the calibration models in the simultaneous determination of Tl and Pb by stripping voltammetry [J],Talanta, 1998,46(1):129–138 B. Scholkopf, A.J. Smola, Learning with Kernels: Support Vector Machines,Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, USA,2001 T. Evgeniou, M. Pontil, T. Poggio, Regularization networks and support vector machines[J], Advances in Computational Mathematics, 2000,13(1):1–50 Peter Lykos, The Beer-Lambert law revisited: A development without calculus [J], Journal of Chemical Education, 1992, 69 (9):730 V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag,New York, USA, 1995 V. Vapnik, Statistical Learning Theory, Wiley, New York, USA, 1998 N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernelbased Learning Methods, Cambridge Univ. Press, Cambridge, UK, 2000 B. Schölkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge,UK, 2002 J.A.K. Suykens, T. van Gestel, J. de Brabanter, B. de Moor, J.Vandewalle, Least Squares Support Vector Machines, World Scientific,Singapore, 2002 R.M. Neal, Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report No. 9702,Department of Statistics, University of Toronto, Canada, 1997. R. Rosipal and L.J. Trejo. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space[J], Journal of Machine Learning Research, 2001,2: 97-123 Csato L, Opper M ,Sparse representation for Gaussian process models[J], Advances in Neural Information Processing Systems, 2001,13:444-450 P. Boyle and M. Frean. Dependent Gaussian processes, Advances in Neural Information Processing Systems,The MIT Press ,2005, 17:217-224

- 131 -