Application of Canonical Correlation Analysis in Student Score Analysis Based on Data Analysis Lu Dai, Jie Chen, Sanping Li, and Shixun Dai 1
Shaanxi Normal University, Xi’an, China Université de Technologie de Troyes, Troyes, France {dailu,lisanping,sxdai}@snnu.edu.cn,
[email protected] 2
Abstract. Student score analysis is an important aspect in the educational research. The use of multivariate methods in the score analysis is essential to teachers or administrators who intend to explore more information from available score data. Canonical correlation analysis is the best technique to employ when the research problem has multiple variables. In this paper, we discuss the principle and application of canonical correlation analysis in the context of student score analysis. We also applied this approach to collected score data and gave out analysis results which would be useful to teachers and administrators. Keywords: Canonical correlation analysis, score analysis, educational research.
1 Introduction It has been known for a long time that questions having to do with the education require the use of statistical methods which are able to analyze simultaneously many interrelated variables, because scientists have realized for many years that human behavior can be understood only by examining many variables at the same time, not by dealing with variables one by one. The use of multivariate methods in the evaluation of educational effectiveness is essential to the administrators or faculty members who view the educational process as a multifaceted and complex system of variables. This is also the case in student score analysis problem. For example, an administrator may have five different records of basic course scores in the one variable set and four different records of major course scores in the other variable set. The research question of interest would be whether there is a relationship between courses in these two classes multi-operationalized in variable sets. In this case as there are more than one variable exists in both sets, traditional correlation analysis or multiple regression could not deal with the problem , then canonical correlation analysis may be in need [1]. Canonical correlation analysis (CCA) is most appropriate to employ when the research problem has multiple predictors and multiple criterion (outcome) variables [1], which is usually the case in the real word of education. Several researchers have explored the applications of CCA in educational problems. For example, in [2], CCA S. Lin and X. Huang (Eds.): CSEE 2011, Part IV, CCIS 217, pp. 481–485, 2011. © Springer-Verlag Berlin Heidelberg 2011
482
L. Dai et al.
was used to investing the effects of school inputs, environmental inputs and gender influence in the production of a joint educational production function in mathematics and science subjects for eighth grade students in Malaysia. In [3], the authors examined how a set of three classroom community variables were related to a set of two student learning variables in a predominantly White sample of 108 online African American and Caucasian graduate students. In [4], CCA was used to investigate the association between how students evaluate the course and how students evaluate the teacher. It can be noticed that CCA has a number of applications as it is a general method for analyzing relationships between two multidimensional variables. In this paper, we focus on the application of CCA within the student score analysis context. Firstly, we shall review the principle of CCA and discussed. After that the operational steps of CCA for the score analysis will be given. Then the analysis will be carried out on collected college student scores, which are set into two classes. Finally, we will give some useful suggestion for teaching and studying process based on the analysis.
2 Theoretical Foundations of Canonical Correlation Analysis Canonical correlation analysis can be seen as the problem of finding two sets of basis vectors, one for x = (x1, … , xn)T and the other for y = (y1, … , yn)T such that the correlation between the projections of the variables onto these basis vectors are mutually maximized. Considering the linear combinations u = αT x and v = βT y of the two variables respectively, this means to maximize the correlation between u and v
ρ = m ax α ,β
αTC α C T
xy
β
αβ C T
xx
yy
β
where Cxx, Cyy denote the covariance matrix of x and y, Cxy denotes the covariance matrix between x and y. The maximum of ρ with respect to α and β is the maximum canonical correlation. The random variables u = αT x and v = βT y are the first pair of canonical variables. Then we seek vectors maximizing the same correlation subject to the constraint that they are to be uncorrelated with the first pair of canonical variables, which gives the second pair of canonical variables. And this procedure may be continued.
3 Applications of CCA in Student Score Analysis Numerical scores are commonly obtained after the administration of teacher-made tests. However, with the development of the nowadays science and education, there are a numbers of courses taught in universities and correspondent scores have complex inherent relationships between them. With a sufficient storage of scores data, teachers should not only study these data with simple statistical operations, but also investigate the potential information. Canonical correlation analysis is a powerful tool that can be used to study the relationship of scores of different classes of courses. For
Application of Canonical Correlation Analysis in Student Score Analysis
483
example, the relation between basic courses and major courses. Analysis of such scores is beneficial in enabling teachers to evaluate the effectiveness of their teachings, in reporting the results of measurements to students, administrators and in determining future plans of teaching action based upon the interpretation of such analysis. Based on the theoretical principle of the canonical correlation analysis, we give the steps for student score analysis hereinbelow: Step 1: Determine the two classes of courses to be analyzed (e.g. basic courses and major courses). Suppose that the scores of the first class are recorded in table X with the dimension n×p, meaning the data consists of n students and p courses scores. The scores of the second class are recorded in table Y with the dimension n×q, meaning the data consists of n students and q courses scores. Step 2: Calculate the covariance matrix of the scores in the first class by Cxx = XT X/n and that of the second class by Cyy = YT Y/n. Calculate the covariance matrix between the two classes by Cxy=XT Y/n. Note that Cyx = CxyT. In order to avoid that mean values and variances affect analysis results, these matrices are usually transformed to the normalized forms, that the correlation matrices of the data, denoted by Rxx, Ryy and Rxy respectively. Step 3: Calculate the matrix A = Rxx-1RxyRyy-1Ryx and the matrix B = Ryy-1RyxRxx-1Rxy. Step 4: Calculate the eigenvalues and eigenvectors of A and B. These two matrices have the same non-zero eigenvalues which equal to squared values of correlation coefficients of canonical variables. The associated eigenvectors correspond to the linear combination coefficients. Step 5: Give interpretation of the relationship of scores associated with different courses using obtained analysis results. One can refer to [5] for some more detailed mathematical steps. With the development of mathematical or statistical computer tools (e.g. Matlab and SPSS), the calculations of CCA analysis could be done with no effort.
4 Case Study In this section, a number of course scores of the first academic year and the second year of 76 college students were collected. We shall illustrate some analysis results with these real data. Firstly, we divide these courses into two classes. The first class consists of (1) X1: mathematical analysis I, (2) X2: mathematical analysis II, (3) X3: linear algebra, (4) X4: discrete mathematics, (5) X5: probability theory, (6) X6: complex functions, which are basic mathematical courses. The second class consists of (1) Y1: physical experiments, (2) Y2: circuit analysis, (3) Y3: program design, (4) Y4: analog circuits, (5) Y6: data structure, which are their major courses Following the steps 1 to 2 we summarized in the previous section, based on the collected scores the correlation matrix are calculated as follows:
484
L. Dai et al.
⎛ 1.0000 ⎜ 0.8390 ⎜ ⎜ 0.7461 R11 = ⎜ ⎜ 0.7174 ⎜ 0.7597 ⎜⎜ ⎝ 0.6738
R 22
⎛ 1.0000 ⎜ 0.5823 ⎜ = ⎜ 0.5191 ⎜ ⎜ 0.6097 ⎜ 0.5989 ⎝
⎛ 0.5162 ⎜ ⎜ 0.6006 ⎜ 0.5047 R12 = ⎜ ⎜ 0.5655 ⎜ 0.6509 ⎜⎜ ⎝ 0.6810
0.8390 1.0000
0.7461 0.7361
0.7174 0.7548
0.7597 0.7442
0.7361 0.7548
1.0000 0.7244
0.7244 1.0000
0.7577 0.7493
0.7442 0.6833
0.7577 0.6454
0.7493 0.7449
1.0000 0.7890
0.5823 1.0000
0.5191 0.7326
0.6097 0.6421
0.7326
1.0000
0.6515
0.6421
0.6515
1.0000
0.7682
0.7902
0.8132
0.6559 0.7295 0.5993 0.7408 0. 7709 0.7708
0.6461 0.7189 0.6821 0.7703 0.7240 0.7038
0.6371 0.5597 0.6493 0.7396 0.7740 0.7438
0.6738 ⎞ 0.6833 ⎟⎟ 0.6454 ⎟ ⎟ 0.7449 ⎟ 0.7890 ⎟ ⎟ 1.0000 ⎟⎠
,
0.5989 ⎞ 0.7682 ⎟⎟ 0.7902 ⎟ ⎟ 0.8132 ⎟ 1.0000 ⎟⎠
0.6298 ⎞ ⎟ 0.6415 ⎟ , 0.6692 ⎟ ⎟ 0.7829 ⎟ 0.7858 ⎟ ⎟ 0.8210 ⎟⎠
and R21 = R12T
Following the step 3, the matrix A and B are respectively, ⎛ ⎜ ⎜ ⎜ A = ⎜ ⎜ ⎜ ⎜⎜ ⎝
-0 .0 2 1 4 -0 .1 3 06 0 .0 7 90 0 .3 1 0 3 -0 .0 0 6 4 -0 .0 3 82 0 .2 3 53 0 .2 6 20 0 .1 9 59
⎛ 0 .1 9 2 2 ⎜ ⎜ 0 .2 5 6 5 B = ⎜ 0 .1 0 2 0 ⎜ ⎜ 0 .0 6 1 5 ⎜ 0 .1 5 5 0 ⎝
0 .1 8 2 9 0 .2 3 2 3 0 .2 2 8 2
0 .1 7 8 0 0 .3 1 0 0 0 .1 6 5 9 0 .1 3 1 0 0 .1 2 8 9
-0 .0 33 0 0 .0 2 7 3 0 .05 4 3 0 .2 6 4 2 0 .2 2 8 7 0 .2 0 4 0
-0 .03 8 7 -0 .0 3 5 9 -0.0 8 4 4 ⎞ 0 .04 6 1 0.0 5 1 1 0 .0 79 8 ⎟⎟ 0 .0 2 8 2 -0 .0 1 22 -0 .0 17 7 ⎟ ⎟ 0 .29 9 9 0.2 5 9 8 0 .2 30 0 ⎟ 0 .27 8 0 0.3 2 3 6 0 .3 09 0 ⎟ ⎟ 0 .25 2 0 0.2 9 7 3 0 .3 62 3 ⎟⎠
0 .1 2 6 8 0 .2 3 1 2 0 .2 5 0 5 0 .1 5 8 4 0 .1 0 7 2
0 .0 5 7 9 0 .1 7 7 5 0 .0 9 9 7 0 .3 3 5 3 0 .1 9 7 9
,
0 .1 2 3 1 ⎞ ⎟ 0 .2 1 4 5 ⎟ 0 .1 2 6 8 ⎟ ⎟ 0 .2 2 8 7 ⎟ 0 .2 4 1 2 ⎟⎠
Non-zero eigenvalues of A and B are given by (λ1, …, λ5)T = (0.8786, 0.2567, 0.1228, 0.0598, 0.0112) Then we shall give analysis with these calculation results. Firstly, we notice that the sum of the first and the second largest eigenvalues has a proportion over than 85% out of the sum of all eigenvalues. They can represent most information in this problem. We thus concentrate on the explanation using them. The eigenvectors associated are marked on the Fig. 1.The first eigenvalue λ1 = 0.8786. Its square root ρ1 = 0.9371 is the correlation coefficient between the first pair of canonical variables u1 and v1. It can be found that that the weights of X4 (discrete math.), X5 (probability theory) and X6 (complex function) are most important in u1. The weights in v1 do not have a great difference. This means these three mathematical courses have a tight relation with major courses. The second eigenvalue λ2 = 0.2567, whose square root ρ2 = 0.5067. The second pair canonical variables are less correlated compared with the first pair. The weights of each variable tell us that X2 (math. analysis II) is highly correlated with Y4 (analog circuit). An interesting fact that X2 has negative effects on Y1, Y2 and Y3 can be noticed. Referring to the schedule of courses, we found that these
Application of Canonical Correlation Analysis in Student Score Analysis
485
Fig. 1. CCA analysis result
courses were arranged during the same semester, which means when students spent more time on one course they might have less time to revise the others. This reminds us that the teachers would better not arrange several examinations too close in date, but leave adequate time for students to review, so as to obtain more just results.
5 Conclusion Canonical correlation analysis is a statistical tool for analyzing two set of variables. It has found its applications in some educational research. In this paper, we discussed its principle and explored its application in student score analysis. Moreover, an example based on collected data was given to illustrate the effectiveness of CCA in this application.
References 1. Sherry, A., Henson, R.K.: Conducting and interpreting canonical correlation - analysis in personality research: A user-friendly primer. Journal of Personality Assessment 84(1), 37– 48 (2005) 2. Ismail, N.A., Cheng, A.G.: Analyzing education production in Malaysia using canonical correlation analysis. International Education Journal 6(3), 308–315 (2005) 3. Rovai, A.P., Poton, M.K.: An examination of sense of classroom community and learning among African American and Caucasian graduate students. Journal of Asynchronous Learning Networks 9(3), 77–92 (2005) 4. Sliusarenko, T., Clemmensen, L.K.H.: Canonical correlation analysis in education: associations between student evaluations of courses and instructors. In: Proceeding of XXVth International Biometric Conference (IBS) (2010) 5. Szedmak, S., Hardoon, D.R., Shawe-Taylor, J.: Canonical correlation analysis: An overview with application to learning methods. Tech. Rep. Department of Computer Science, Royal Holloway University of London (2003)