Functional data analysis as a tool to correlate ... - Semantic Scholar

Report 4 Downloads 64 Views
Applied Mathematics and Computation 223 (2013) 476–482

Contents lists available at ScienceDirect

Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc

Functional data analysis as a tool to correlate textural and geochemical data C. Ordóñez a,⇑, C. Sierra a, T. Albuquerque b, J.R. Gallego a a Department of Mining Exploitation and Prospecting, Polytechnic School of Mieres, University of Oviedo, Campus de Mieres, Gonzalo Gutiérrez, s/n, 33600 Mieres, Asturias, Spain b IPCB-CVRM, Polytechnic Institute of Castelo Branco, IPCB, Geo-Systems Centre, ISTUTL, Lisbon, Portugal

a r t i c l e

i n f o

Keywords: Functional data analysis Granulometric curve Element concentration

a b s t r a c t This paper discusses the use of functional data analysis to determine the interactions between the chemical composition and grain size. The proposed method is compared for a case study with others that involve the determination of the coefficient of correlation between the concentration of each chemical element in a bulk sample and representative points in the granulometric curve. The results obtained by the two approaches are consistent and indicate an increasing concentration of certain metals in the fine particle-size fractions. Moreover, functional data analysis provides useful additional information on the relationship between the concentration of the chemical elements in bulk samples and particle size distribution. Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction Grain-size distribution can offer information about several features of sediment, thus providing evidence of sedimentary origin (allochthonous, autochthonous, anthropogenic), transport conditions (rolling, creeping, saltation, flotation, dissolution, suspension etc.), or even chemical composition [1,2]. Sediment texture can be correlated with its chemical composition and genesis since the physical and chemical weathering of rocks combined with selective processes of segregation during transport and deposition cause the accumulation of particular materials in certain grain-size fractions [3,4]. Thus, during grain-size fractionation coarser fractions commonly have a higher concentration of SiO2 or calcium oxide, while the finest fraction may be enriched in alkalies such as Na and K, as well as Al2O3, MgO and TiO. Similarly, there is a close relationship between decreasing grain size and pollutant enrichment [5]. This occurs because the surface per unit area increases with smaller sizes, thus sediments tend to retain metals and other contaminants with surface-type chemical reactions, and smaller grain sizes are more chemically active [6,7]. As a result, some recovery and remediation techniques exploit heavy metal enrichment within the fine fractions [8]. Various parameters, such as particle size and the concentration of certain chemical elements like Al, Fe, and Ti (which are used as proxies for grain size because they tend to co-vary with it), are commonly used in geochemistry as normalization elements in order to compensate for the variability generated by mineralogical composition and to facilitate comparisons between areas with different grain sizes [9,10]. These elements are characterized by presenting a notable linear correlation

⇑ Corresponding author. E-mail addresses: [email protected] (C. Ordóñez), [email protected] (C. Sierra), [email protected] (T. Albuquerque), [email protected] (J.R. Gallego). 0096-3003/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.amc.2013.08.032

C. Ordóñez et al. / Applied Mathematics and Computation 223 (2013) 476–482

477

between grain size and concentration. If normalization is not carried out, erroneous conclusions can be reached, attributing high concentrations of pollutants to an anthropogenic source, when they are in fact due to grain size. In the absence of grain-size variations, geochemical data of siliciclastic sediments may be used to characterize size-independent processes as sediment provenance, weathering, mixing or diagenesis [11]. For all the reasons above, it is of interest to determine the chemical composition of distinct grain-size fractions. In this regard, the standard procedure consists of a sieving (size fractionation of the soil) and subsequent chemical analysis of each fraction, but this is a laborious procedure and it favors contamination during manipulation and exponentially increases the analytical costs. For this reason, it is pertinent to establish correlations between particle size and chemical composition of the bulk sample. This procedure is based on the use of descriptive parameters of the particle size distribution, such as the mean or median size of grain [12], or parameters such as the quartile diameters (d75, d50, and d25), others like d10, d50, d90 (grain size below which 10%, 50%, 90% of the grains are smaller than these sizes) and even the sand, clay and silt content. Such studies are almost always performed using Pearson’s correlation coefficient [13,14], and can be supported by other techniques such as principal component analysis and other methods of factorial analysis [15]. However, the disadvantage of using granulometric parameters in the form of discrete variables is common to all these procedures, and a great deal of information held in the granulometric curve is wasted, particularly when these data have been obtained by modern methods of laser diffraction and the function is defined by dozens of points. Here we discuss and compare these methods of correlation between the grain-size and chemical composition of the bulk sample with other approaches based on functional data analysis. Functional data analysis has recently been applied to address other kinds of problems, such as environmental pollution [16], GPS measurement in forest environments [17], and the analysis of multispectral data [18], among others. The usefulness of the proposed method is tested for alluvial detrital sediment sampled in an area notably affected by heavy metal pollution. Here we consider its applicability for selecting normalizer elements, as well as its efficacy for identifying and describing the behavior of contaminants and natural allochthonous inputs. 2. Materials and methods 2.1. Data collection and chemical analyses Samples were collected in Avilés Estuary (North of Spain), where heavy industrial activity has contaminated the sediments during several decades [15,19]. A total of 54 bed-deposited sediment samples from the upper 0–20 cm beneath a water depth of about 20 cm were collected in the Llodero Cove area in January 2012. Samples were collected using a stratified systematic sampling method at short, medium and long distances in order to obtain a representative picture of the total variability of the variables. Fig. 1 shows the spatial distribution of the sample points. The samples were packed and sealed in pre-washed polyethylene bags and dried at room temperature. They were then fractionated into two sub-samples in order to determine their composition and grain-size distribution. Chemical analyses were performed by Inductively Coupled Plasma – Optical Emission Spectroscopy after sample leaching by means of an ‘Aqua regia’ digestion (HCl + HNO3). Grain-size distribution was obtained using a Laser Dispersion Particle Analyser (Beckman Inc. Coulter) after subjecting samples to a disaggregation treatment with two dispersants, namely sodium hexametaphosphate and sodium carbonate, and eliminating the organic matter with hydrogen peroxide and acetic acid.

Fig. 1. Geography location of the study area (left). Spatial distribution of the samples (right).

478

C. Ordóñez et al. / Applied Mathematics and Computation 223 (2013) 476–482

2.2. Mathematical model The aim of this study is to determine the relationship between the concentration of pollutants in a soil and the grain size of each sample. This relationship can be approached as a linear regression problem:

EðYjXÞ ¼ a þ

p X bi X i ¼ a þ hX; bi

ð1Þ

i¼1

where Y is the response variable, a and b = ðb1 ; . . . ; bp ÞT are the regression coefficients and X ¼ ðX 1 ; . . . ; X p ÞT the p-dimensional input variable. In the case analyzed, the dependent variable is a vector of real numbers, representing the concentration of a particular pollutant in the different samples, while the independent variable for each sample is given by a discrete set of values for an increasing sequence of grain sizes (so-called grain-size curve of the sample). A functional linear regression model [20] for the particular case of scalar response and functional predictors is an extension of the multivariate linear regression model to the case of infinite-dimensional data. Then Eq. (1) is written as follows:

EðYðsÞjXÞ ¼ a þ

Z

XðsÞbðsÞds ¼ a þ hX; bi

ð2Þ

T

where coefficient a is a scalar and the input variable X : T ! R and the regression coefficients b : T ! R are represented as functions. Several approaches can be used to estimate the functions X(s) and b(s) from discrete points, smoothing by means of the decomposition in basis functions being one of the most frequent in functional data analysis [21]:

Xn

XðsÞ ¼

c / k¼1 k k

¼ cT /ðsÞ

b w p¼1 p p

¼ b wðsÞ

Xm

bðsÞ ¼

ð3Þ

T

where c and b are vectors of coefficients, and /(s) = ð/1 ðsÞ; . . . ; /n ðsÞÞT and wðsÞ ¼ ðw1 ðsÞ; . . . ; wm ðsÞÞ are the basis functions. These functions can be of a different type, such as polynomial, exponential, B-splines, Fourier functions, etc. By substituting expressions in (3) in Eq. (2), an estimation of Y can be expressed as follows:

^ þ cT J/w b Y^ ¼ a

ð4Þ

where the n  m matrix J/w is given by:

J/w ¼

Z

wðsÞ/ðsÞds ¼ hw; /i

ð5Þ

T

When the basis functions are orthonormal, such as with Fourier functions, the matrix J/w results in identity. ^ becomes simply Introducing the notation f ¼ ða; b1 ; . . . ; bm Þ, Z ¼ ½1; cT J/w , the prediction Y

Y^ ¼ Zf

ð6Þ

The estimation of vector f is obtained by minimizing the residual sum of squares

LMMSEða; bÞ ¼ kY  a  hX; bik2L2 ðT Þ

ð7Þ

using the least squares method, as follows :

^f ¼ ðZT ZÞ1 ZT Y

ð8Þ

However, minimizing the square of the residuals is not the only aim of the estimation. It is also often necessary to avoid excessive local fluctuation in the estimated function. Regularization with roughness penalty can be used for this purpose. To this end, given any twice differentiable function b, it is possible to define the penalized residual sum of squares:

 2 Z Z PENSSEk ða; bÞ ¼ Y  a  XðsÞbðsÞds þ k ½D2 bðsÞds

ð9Þ

T

T

where the operator D2 represents the second derivative, and the smoothing parameter k > 0 controls the trade-off between roughness and infidelity to the observed data [20]. The solution to the regularization problem with roughness penalty is analogous to that of Eq. (8):

^f ¼ ðZT Z þ kRÞ1 ZT Y

ð10Þ R

2

2

2

2

being R a m  m matrix with elements Rjk ¼ T D wj ðsÞD wk ðsÞds ¼ hD wj ; D wk iL2 ðTÞ The smoothing parameter k can be chosen either subjectively or by means of cross-validation. When the latter approach is used, a cross-validation score can be defined as:

479

C. Ordóñez et al. / Applied Mathematics and Computation 223 (2013) 476–482

Fig. 2. Differential distribution curves for the 53 samples collected. Smoothing was carried out using 15 cubic B-splines.

CVðkÞ ¼

N  X

ðjÞ

yj  ak

j¼1



Z T

2 ðjÞ X j ðsÞbk ds ðjÞ

ð11Þ

ðjÞ

N being the sample size and ak and bk the estimates of a and b obtained by minimizing the penalized residual sum of squares based on all the data except ðxj ; yj Þ. The smoothing parameter k is obtained by minimizing CVðkÞ: 3. Results and discussion 3.1. Sample descriptive statistics and Pearson’s correlation The functional regression model discussed in the previous section was applied to the data set taken in the field. Each sample has a concentration value for each chemical element and the corresponding grain curve (differential curve) in a set of Table 1 Descriptive statistics for element concentration and grain size distribution.

Al Ti Fe Cd Ca Na Mean d10 d50 d90

Minimum (mg/kg)

Maximum (mg/kg)

Mean (mg/kg)

Stand. deviation (mg/kg)

20.00 2.00 65.00 5.00 20.00 1.00 147.36 40.25 142.53 221.30

75.00 16.00 775.00 137.00 320.00 103.00 316.50 220.87 311.23 424.06

32.06 4.09 131.63 35.91 81.13 29.02 231.58 141.69 230.82 319.89

11.89 2.70 126.16 28.28 53.24 17.46 30.55 46.67 27.35 32.10

Table 2 Pearson’s correlation coefficient between pollutants concentration and representative parameters of the granulometric curve.

d10 d50 d90 * **

Al

Ti

Fe

Cd

Ca

Na

0.796** 0.725** 0.562**

0.380 0.268 0.188

0.289* 0.145 0.090

0.781** 0.558** 0.409**

0.405** 0.416** 0.278*

0.540** 0.583** 0.475**

The correlation is significant at 0.05 (bilateral). The correlation is significant at 0.01 (bilateral).

480

C. Ordóñez et al. / Applied Mathematics and Computation 223 (2013) 476–482

discrete grain sizes. The whole grain size curves were obtained by smoothing using 15 cubic B-splines, as shown in Fig. 2. The Y axis represents the percentage volume of the soil sample for each grain size. The parameters studied were the concentrations of the elements Al, Ti, Fe, Cd, Ca, and Na, the grain size parameters d10, d50, d90, and the mean grain size. The main statistical descriptors related to the chemical characterization and particle size are shown in Table 1. Textural characterization obtained by means of GRADISTAT v 10.0 classified the sediments as sands and muddy sands [22].

Al

−3

0

x 10

Ti

−5

0

x 10

−0.2 −0.5 −1

−0.6

Reg. coefficient

Reg. coefficient

−0.4

−0.8 −1 −1.2

−1.5 −2 −2.5

−1.4 −3

−1.6 −1.8

100

150

200 250 Grain size (um)

300

350

−3.5

400

150

200

250

300

350

400

Grain size (um)

Fe

0

100

Cd

0

−0.002

−0.01

−0.004 −0.02 Reg. coefficient

−0.008 −0.01 −0.012

−0.03 −0.04 −0.05

−0.014 −0.06 −0.016 −0.07

−0.018 −0.02

100

150

−3

3

200 250 Grain size (um)

300

350

−0.08

400

0

2

−0.2

1

−0.4

0 −1 −2

150

200 250 Grain size (um)

300

350

400

Na

x 10

−0.6 −0.8 −1

−3 −4

100

−3

Ca

x 10

Reg. coefficient

Reg. coefficient

Reg. coefficient

−0.006

−1.2

100

150

200 250 Grain size (um)

300

350

400

−1.4

100

150

200 250 Grain size (um)

Fig. 3. Regression coefficient functions for the 6 chemical elements studied.

300

350

400

481

C. Ordóñez et al. / Applied Mathematics and Computation 223 (2013) 476–482

The relationships between element concentration and grain sizes were studied by means of Pearson’s correlation coefficient. Table 2 shows the correlation coefficients and their significance levels for parameters d10, d50 and d90. It can be observed that the correlation coefficients are negative, which is consistent with the idea that the smaller the grain size, the greater the concentration of certain elements, and particularly that of contaminants. Parameter d10 (10% of the sediment is smaller than this size), which represents the smaller grain sizes, was often better correlated with the chemical composition than d50 (median size grain) and d90 respectively, except for Ca and Na, for which the correlation coefficient with the elements increased from d10 to d50 and then decreased again for d90. 3.2. Functional data analysis Conversely, functional analysis offers a continuous representation of the regression coefficients for the different particle sizes, together with the 95% confidence lines (Fig. 3). For all the cases, the regression coefficients (b(s)) were negative, and the higher, in absolute value, the grain size, the lower the coefficient (except for the Ca). This observation indicates that the influence of the granulometric curve, or what is the same, the effect of the volume percentage of a given grain size in the concentrations of the elements, is higher in the finer fractions. Table 3 lists the coefficients of determination obtained for each chemical element. The optimum penalty factors for the 6 models were obtained by minimizing the cross-validation score, following Eq. (11). Fig. 4 shows the variation of this score for different values of the penalty coefficient corresponding to Cd. The minimum value corresponds to k ¼ 1010 . For the 5 remaining elements, similar optimum penalty factors were obtained. This result is coherent with the notion that a higher coefficient of determinations requires lower penalization factors, thus producing wiggle shapes of the regression coefficient with no physical interpretation. For the penalty factors determined, the highest coefficient of determination (RSQ) between grain size and composition was observed for Al (RSQ = 0.72), which is consistent with the common use of this metal as a normalizer [23,24]. Cd also showed high correlations between composition and grain size (RSQ = 0.59), thus corroborating a clear relationship between increasing Cd concentrations with smaller grain sizes. The presence of Cd in the Avilés area is very low in the geologic substrate; however, there is a notable anthropogenic input of this metal promoted by the emissions of small grain-size particles from the metallurgical industry [15]. Ti, which represents silicate minerals, showed a less clear correlation with the grain size (RSQ = 0.36) than the former metals, thus suggesting that its use as a normalizer is less desirable than Al. In the same way, Fe, which is also frequently used as a normalizer [24,25], correlated poorly with the particle size (RSQ = 0.19). This finding is coherent with the fact that the use of this element as a normalizer can be compromised in anoxic environments, such as that of these sediments [26], and with the non-significant correlation obtained using Pearson’s correlation coefficient. Table 3 Coefficients of determination for the 6 elements studied. Element

Cd

Al

Ca

Fe

Ti

Na

RSQ

0.59

0.72

0.48

0.19

0.36

0.41

Cd

340

Cross−validation score

320 300 280 260 240 220 200

5

10

log10 smoothing parameter λ Fig. 4. Variation of the cross-validation score with the penalty factor.

15

482

C. Ordóñez et al. / Applied Mathematics and Computation 223 (2013) 476–482

Elements such Ca and Na showed a regression coefficient (b(s)) with a different shape. Ca presented a non-significant correlation coefficient for the extreme grain-size values and negative shape for the intermediate ones. This irregularity can be attributed to the dismantling of the limestone bedrock in the vicinity, as well as to the multiple inputs produced by the remains of bivalves in the area (biogenic origin). The different behavior of Na – shown in the plot as a straight line – may be the consequence of sea salt contribution to sediment samples. This anomalous behavior for Ca and Na was also detected using Pearson´s correlation coefficient, which did not increase with the decreasing grain size. 4. Conclusions We propose functional data analysis as an alternative approach to study the relationship between the concentrations of chemical elements and grain-size since it offers a continuous representation of the correlation between these two variables, unlike conventional procedures that focus only on discrete points of the granulometric curve. The procedure tested for a marine detrital sediment showed similar behavior for the 6 elements studied in terms of the correlation between concentration and grain size distribution, except for Ca, which showed a relationship between concentration and particle size distribution curve that was not significant in the extreme grain sizes but in the intermediate sizes. Al and Cd concentration showed higher correspondence with the granulometric curve, while Fe showed the lowest association. These results are consistent with those obtained by discrete parameters representing the particle size. Acknowledgement Carlos Sierra obtained a grant from the ‘‘Severo Ochoa’’ programme (Ficyt, Asturias, Spain). References [1] E.N. Bui, J. Mazullo, L.P. Wilding, Using quartz grain size and shape analysis to distinguish between aeolian and fluvial deposits in the Dallol Bosso of Niger (West Africa), Earth Surf. Processes Landforms 14 (1990) 157–166. [2] T. Ohta, Geochemistry of Jurassic to earliest Cretaceous deposits in the Nagato Basin, SW Japan: implication of factor analysis to sorting effects and provenance signatures, Sediment. Geol. 171 (1–4) (2004) 159–180. [3] P. Van De Kamp, B.E. Leake, Petrology, geochemistry, and alteration of Pennsylvanian-Permian arkose, Colorado and Utah, Geol. Soc. Am. Bull. 105 (1993) 1571–1582. [4] C. Zhang, L. Wang, G. Li, S. Dong, J. Yang, X. Wang, Grain size effect on multi-element concentrations in sediments from the intertidal flats of Bohai Bay, China, Appl. Geochem. 17 (2002) 59–68. [5] A.J. Horowitz, K. Elnck, The relation of stream sediment surface mea, grain size, and composition to trace element chemistry, Appl. Geochem. 2 (1987) 437–451. [6] A.J. Horowitz, A Primer on Sediment-trace Element Chemistry, second ed., Lewis Publishers, Chelsea, 1991. [7] D. Secrieru, A. Secrieru, Heavy metal enrichment of man-made origin of superficial sediment on the continental shelf of the north-western black sea, Estuarine Coastal Shelf Sci. 54 (3) (2002) 513–526. [8] C. Sierra, J.R. Gallego, E. Afif, J.M. Menéndez-Aguado, F. González-Coto, Analysis of soil washing effectiveness to remediate a brownfield polluted with pyrite ashes, J. Hazard. Mater. 180 (2010) 602–608. [9] W.X. Liu, X.D. Li, Z.G. Shen, D.C. Wang, O.W. Wai, Y.S. Li, Multivariate statistical study of heavy metal enrichment in sediments of the Pearl River Estuary, Environ. Pollut. 121 (3) (2003) 377–388. [10] M. Sierra, F.J. Martínez, J. Aguilar, Baselines for trace elements and evaluation of environmental risk in soils of Almería (SE Spain), Geoderma 139 (2007) 209–219. [11] M.R. Bloemsma, M. Zabel, J.B.W. Stuut, R. Tjallingii, J.A. Collins, G.J. Weltje, Modelling the joint variability of grain size and chemical composition in sediments, Sediment. Geol. 280 (2012) 135–148. [12] C. Viscosi-Shirley, K. Mammone, N. Pisias, J. Dymond, Clay mineralogy and multi element chemistry of surface sediments on the Siberian-Arctic shelf: implications for sediment provenance and grain size sorting, Cont. Shelf Res. 23 (2003) 1175–1200. [13] B. Rubio, M.A. Nombela, F. Vilas, Geochemistry of major and trace elements in sediments of the Ria de Vigo (NW Spain): an assessment of metal pollution, Mar. Pollut. Bull. 40 (11) (2000) 968–980. [14] C. Zhang, L. Wang, G. Li, S. Dong, J. Yang, X. Wang, Grain size effect on multi-element concentrations in sediments from the intertidal flats of Bohai Bay, China, Appl. Geochem. 17 (1) (2002) 59–68. [15] J.R. Gallego, A. Ordóñez, J. Loredo, Investigation of trace element sources from an industrialized area (Avilés, northern Spain) using multivariate statistical methods, Environ. Int. 27 (2002) 589–596. [16] C. Ordóñez, J. Martínez, J.F. de Cos Juez, F. Sánchez Lasheras, Comparison of GPS observations made in a forestry setting using functional data analysis, Int. J. Comput. Math. 89 (3) (2012) 402–408. [17] C. Ordóñez, J. Martínez, A. Saavedra, A. Mourelle, Intercomparison exercise for gases emitted by a cement industry in Spain: a functional data approach, J. Air Waste Manage. Assoc. 61 (2) (2011) 135–141. [18] L. Sugianto, L. Shawn, Functional data analysis of multi-angular hypersectral data on vegetation, Int. J. Sci. Technol. 1 (1) (2012) 30–39. [19] J.M. Leal, Guía didáctica del puerto de Avilés, Tórculo Ediciones, Santiago de Compostela, 2007. [20] J.O. Ramsay, B.W. Silverman, Applied Functional Data Analysis, Springer-Verlag, New York, 2005. [21] J.S. Simonoff, Smoothing Methods in Statistics, Springer, New York, 1996. [22] S.J. Blott, K. Pye, GRADISTAT: a grain size distribution and statistics package for the analysis of unconsolidated sediments, Earth Surf. Processes Landforms 26 (11) (2001) 1237–1248. [23] H.L. Windom, S.J. Schropp, F.D. Calder, J.D. Ryan, R.G. Smith, L.C. Burney, F.G. Lewis, C.H. Rawllnsont, Natural trace metal concentrations in estuarine and coastal marine sediments of the southeastern United States, Environ. Sci. Technol. 23 (1989) 314–320. [24] S. Covelli, G. Fontolan, Application of a normalization procedure in determining regional geochemical baselines, Environ. Geol. 30 (1997) 34–45. [25] K.D. Daskalakis, T.P. O’Connor, Normalization and elemental sediment contamination in the coastal United States, Environ. Sci. Technol. 29 (1995) 470–477. [26] K. Schiff, S.B. Weisberg, Iron as a reference element for determining trace metal enrichment in California coastal shelf sediments, Mar. Environ. Res. 48 (1999) 68–77.