Streamflow rating uncertainty: Characterisation ... - Semantic Scholar

Report 6 Downloads 56 Views
Environmental Modelling & Software 63 (2015) 32e44

Contents lists available at ScienceDirect

Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft

Streamflow rating uncertainty: Characterisation and impacts on model calibration and performance ~ a-Arancibia a, *, Yongqiang Zhang a, Daniel E. Pagendam b, Neil R. Viney a, Jorge L. Pen Julien Lerat d, Albert I.J.M. van Dijk e, a, Jai Vaze a, Andrew J. Frost c a

Land and Water Flagship, CSIRO, Black Mountain, GPO 1666, Canberra, ACT 2601, Australia Digital Productivity and Services Flagship, CSIRO, EcoSciences Precinct, PO Box 2583, Brisbane, QLD, Australia Bureau of Meteorology, PO Box 413, Darlinghurst, NSW 1300, Australia d Bureau of Meteorology, GPO Box 2334, Canberra, ACT 2601, Australia e Fenner School of Environment & Society, Australian National University, Canberra, ACT 0200, Australia b c

a r t i c l e i n f o

a b s t r a c t

Article history: Received 23 June 2014 Received in revised form 2 September 2014 Accepted 9 September 2014 Available online

Common streamflow gauging procedures require assumptions about the stage-discharge relationship (the ‘rating curve’) that can introduce considerable uncertainties in streamflow records. These rating uncertainties are not usually considered fully in hydrological model calibration and evaluation yet can have potentially important impacts. We analysed streamflow gauge data and conducted two modelling experiments to assess rating uncertainty in operational rating curves, its impacts on modelling and possible ways to reduce those impacts. We found clear evidence of variance heterogeneity (heteroscedasticity) in streamflow estimates, with higher residual values at higher stage values. In addition, we confirmed the occurrence of streamflow extrapolation beyond the highest or lowest stage measurement in many operational rating curves, even when these were previously flagged as not extrapolated. The first experiment investigated the impact on regional calibration/evaluation of: (i) using two streamflow data transformations (logarithmic and square-root), compared to using non-transformed streamflow data, in an attempt to reduce heteroscedasticity and; (ii) censoring the extrapolated flows, compared to no censoring. Results of calibration/evaluation showed that using a square-root transformed streamflow (thus, compromising weight on high and low streamflow) performed better than using non-transformed and log-transformed streamflow. Also, surprisingly, censoring extrapolated streamflow reduced rather than improved model performance. The second experiment investigated the impact of rating curve uncertainty on catchment calibration/evaluation and parameter estimation. A Monte-Carlo approach and the nonparametric Weighted Nadaraya-Watson (WNW) estimator were used to derive streamflow uncertainty bounds. These were later used in calibration/evaluation using a standard Nash-Sutcliffe Efficiency (NSE) objective function (OBJ) and a modified NSE OBJ that penalised uncertain flows. Using square-root transformed flows and the modified NSE OBJ considerably improved calibration and predictions, particularly for mid and low flows, and there was an overall reduction in parameter uncertainty. Crown Copyright © 2014 Published by Elsevier Ltd. All rights reserved.

Keywords: Modelling Calibration Ensemble Stage-discharge Rating curve Nadaraya-Watson Heteroscedasticity Uncertainty

1. Introduction Streamflow data are generally estimated from stage measurements through a stageedischarge relationship (the ‘rating curve’), developed through measurement of flow using manual methods (estimation of flow velocity combined with estimates of river width

* Corresponding author. CSIRO Land and Water, GPO Box 1666, Canberra ACT 2601, Australia. Tel.: þ61 (2)6246 5711; fax: þ61 (2)6246 5800. ~ a-Arancibia). E-mail address: [email protected] (J.L. Pen http://dx.doi.org/10.1016/j.envsoft.2014.09.011 1364-8152/Crown Copyright © 2014 Published by Elsevier Ltd. All rights reserved.

and height for subsections of the river) and relating that to measured flow height at various points in time; then interpolation/ extrapolation of that relationship across all height-flow levels using regression techniques to produce a curve. Several sources of uncertainty can be accounted for in this procedure including measurements of flow height, width and shape of the river crosssection and inaccuracies in the measurement of the velocityearea relationship (Domeneghetti et al., 2012). Another source of uncertainty arises from the regression techniques used to derive the stageedischarge relationship. The classical approach for deriving

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

stage-discharge (rating) relationship involves fitting a curve for (log-transformed) discrete rating measurements using (non)linear least squares. This implicitly assumes that the measurement residuals have a normal distribution and are unrelated to the expected discharge (Petersen-Øverleir, 2004). Residuals for existing curves often show non-normal distributions (e.g. Tomkins and Davidson, 2011) with higher residual values at higher stage values (heteroscedasticity). Scarce sampling and heteroscedasticity observed in streamflow residuals may introduce large uncertainty in streamflow estimates based on extrapolation of the rating curve (Westerberg et al., 2011). These streamflow observations are the core data used to calibrate hydrological models. Objective functions (OBJs) are used in calibration to minimise the differences between observed and modelled streamflow and also to assess the model performance under prediction. Traditionally, the minimisation is performed against the sum-of-squared residuals under the assumption that these residuals are homoscedastic in nature (i.e. there is no variance heterogeneity in the streamflow data). This assumption is often not valid for streamflow data (Petersen-Øverleir, 2004) and its violation may overestimate goodness-of-fit metrics used in simulations (McMillan et al., 2010). Moreover, routinely used OBJs in calibration, for example the NashSutcliffe Efficiency (NSE, Nash and Sutcliffe, 1970), place high weights on high flows which may be extrapolated, thus potentially biasing predictions (Croke, 2007). In this paper, we investigate the impact of streamflow rating uncertainty on hydrological model calibration and performance (i.e. ‘prediction’ using streamflow data from catchments not used for model calibration or split-sample ‘evaluation’ using streamflow data from a period not used for model calibration). Firstly, we use a comprehensive hydrometric dataset of 65 streamflow gauges (described in Section 2) to assess the occurrence of heteroscedasticity and extrapolation in rating curves (Section 2.1). Secondly, we conduct two types of experiments: (i) The first experiment makes use of the entire streamflow dataset (65 streamflow time-series) to assess the impacts of including uncertain extrapolated streamflow data in a regional calibration/prediction experiment (Section 2.2). Several methods were trialled to address this problem; from censoring all extrapolated high flows to using streamflow Box-Cox transformation (Box and Cox, 1964; Bennett et al., 2013) in an attempt to reduce heteroscedasticity. For this experiment, we calibrated a single parameter set (n ¼ 28) of the process-based landscape water balance model Australian Water Balance Assessment system Landscape model (AWRAL) (van Dijk, 2010; van Dijk and Renzullo, 2011; Vaze et al., , 2013) in 33 of the 65 stations and performance was independently evaluated for the remaining 32 stations. We assessed the impact of censoring high flows on the NSE compared to no censoring. We repeated the experiment using two streamflow transformations (logarithmic and square-root). The regional calibration (i.e. a single set of parameters to predict flows in a large geographical domain) was chosen for methodological and practical reasons. Firstly, predictions of streamflow and other fluxes and stores (e.g. evapotranspiration and soil water) are required in many ungauged basins with dissimilar climate and biophysical characteristics; a regional calibration using a large amount of catchment streamflow data might yield better results than parameter regionalisation techniques (Parajka et al., 2007; Vaze and Teng, 2011). Secondly, AWRA-L has been regionally calibrated (using a similar approach as in this paper) against Australian streamflow and evapotranspiration data; producing results that markedly improved compared to a

33

previous non-calibrated version (version 1.0 vs. 0.5; Viney et al., 2011). The calibration results were also similar to results from locally calibrated conceptual models, showing that AWRA-L can capture the different climatic and biophysical characteristics that affect streamflow (Viney et al., 2011). Thirdly and finally, AWRA-L is currently used operationally to provide information on water fluxes and stores across Australia and its being continuously refined. (ii) The second experiment investigates the impact of rating curve uncertainty on the NSE and parameter estimation in a local calibration/evaluation (Section 2.3) using a MonteCarlo approach and the nonparametric Weighted NadarayaWatson (WNW) estimator. We use these methods for quantifying the error in the rating curve because they capture changes in the rating curve with time, they are nonparametric and make minimal assumptions about the probabilistic distribution of the data. We employed them to derive rating curve uncertainty bounds for 100 streamflow realisations. To interpret impacts on parameter space, we calibrated 4 parameters of the simpler conceptual rainfall-runoff model GR4J (compared to the 28-parameter AWRA-L) (Perrin et al., 2003). These were later used in split-sample calibration/evaluation in a single station using a standard NSE OBJ and a modified NSE OBJ, which used the uncertainty bounds to penalise uncertain flows. Again, we repeated the experiment using logarithmic and square-root streamflow transformations. The data and methods are described in Section 2. The results of the experiments are presented and analysed in Section 3, the findings are discussed in Section 4 and finally conclusions are drawn (Section 5). 2. Data and methods The New South Wales (NSW) Office of Water (NoW) in Australia regularly republishes the ‘Pinneena’ water database on DVD (http:// waterinfo.nsw.gov.au/pinneena/gw.shtml). The version used here (December, 2009) includes 127,000 years of daily streamflow information from 1400 stations. The database includes records of hydraulic control type (including concrete structures, rocky river bed not reinforced with concrete, gravel or sand river bed), stage height, rating tables, interpolation method, gauging measurements and percentage deviations of gaugings from the rating curve. This detailed database can be used to infer uncertainty due to extrapolation, the occurrence of heteroscedasticity and with the use of statistical techniques, uncertainty in streamflow data. The catchments used in the experiments performed here (Fig. 1) were chosen from a subset in the ‘Pinneena’ database that was selected for previous modelling studies (Zhang et al., 2013) because they are headwater catchments without significant influence from river regulation, urbanisation and irrigation. From this subset, catchments with >15 years of daily streamflow data (all in ML d1) during 1980e2008 and >70 rating measurements were selected, resulting in a total of 65 stations. Hourly streamflow data was extracted from the ‘Pinneena’ database and averaged from 9:00 a.m. to 9:00 a.m. the following day, to coincide with the time of daily rainfall recording. The streamflow volumes were converted to areal average streamflow (mm d1). Climate data used as model forcing included rainfall, PriestleyTaylor potential evapotranspiration, minimum and maximum temperature and incoming shortwave radiation. These were sourced or derived from the Specialised Information for Land Owners (SILO) dataset (Jeffrey et al., 2001) available from the Queensland Department of Environment and Resource

34

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

Fig. 1. Geographical location of the 65 catchments used in this study and runoff coefficient for 1980e2008. Coloured in red is the catchment drainage area of station 206001 (runoff coefficient 0.42, see Section 2.3). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Management (QDERM) and the Queensland Climate Change Centre of Excellence (QCCCE) (http://www.longpaddock.qld.gov.au/silo/). The selected catchments are located east and west of the Great Dividing Range, which acts as a natural barrier to oceanic moist air resulting in a marked rainfall gradient from east to west, and an associated gradient in runoff coefficients (Fig. 1). The rainfall range is 590e1770 mm y1, with a mean of 955 ± 265 mm y1 (standard deviation) for 1980e2008. Priestley-Taylor potential evapotranspiration range is 1085e1561 mm y1, with a mean of 1320 ± 100 mm y1. Catchment areas vary from 55 to 6815 km2 and include fully forested, fully cleared catchments as well as catchments with a mixture. Cleared areas include a variable mix of cropping and grazing.

2.1. Heteroscedasticity and extrapolation in rating curves Heteroscedastic residuals in the observed dataset were detected visually by inspecting plots of stage height against maximumnormalised residuals (Petersen-Øverleir, 2004). Residuals were computed from the deviations between the measured rating and rating curve derived streamflow. A trumpet shape typically characterises heteroscedastic residuals (see Fig. 1 in Petersen-Øverleir, 2004). Stations were classified through visual inspection as follows: moderate to strong evidence of heteroscedasticity as type A; none to slight evidence of heteroscedasticity as type B and inconclusive evidence as type C. Stations were classified as type C when there were not enough measurements across the stage range to make a strong inference. Streamflow data are archived with numerical quality flags. Only the more reliable data, flagged with quality code 150, were considered in the averaging. If one hourly data point had a quality code >150, the whole day was censored (NoW, 2011, pp. 15e36). Other quality control tests censored spikes and other suspicious data (see Zhang et al., 2013 for details). However, the accuracy of quality flags can sometimes be questionable and extrapolated or

gap-filled data may remain in the dataset (van Dijk and Warren, 2010). For each station considered, ‘Pinneena’ provided the date of rating measurements, deviations from the rating curve and the dates for when a rating curve was used (a numeric code identifies each rating curve). Typically, different curves may be used for different parts of the record; however there is no information on which ratings were used to derive each rating curve. The scarcity of high-flow rating measurements (presumably because of infrequent and short-lived peak flow events) was notable for all 65 stations. We took the highest and lowest rating measurement to define the range of the rating curve, as would have been done operationally (Kerry Tomkins, pers. comm., CSIRO Land and Water Flagship). We censored hourly streamflow data outside of this range, and by implication the whole day. An example to illustrate this procedure for station 213200 is shown in Fig. 2. The plot shows stage vs. all the ratings (black crosses) in the record (04/09/1929e05/06/2003) and the whole streamflow time series data (black points) plotted. The dashed vertical line indicates the upper limit of the rating curve range. The streamflow data to the right of the line is considered extrapolated and censored. Note that the censored data was flagged of ‘reliable’ quality (quality code  150) in the database. This procedure was repeated for the 65 stations and a second ‘censored’ dataset is used in conjunction with the original ‘uncensored’ dataset in a regional model calibration/prediction experiment (Section 2.2).

2.2. Impacts of censoring streamflow data in a regional calibration experiment We followed the regional calibration procedure of Zhang et al. (2011) using AWRA-L version 0.5. AWRA-L can be considered a hybrid between a simplified grid-based land surface model and a non-spatial catchment model applied to individual cells. Many of its equations are process-based and were tested against observations as part of model development (van Dijk and Warren, 2010;

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

35

total of 12 evaluation variants. Differences between the three calibration approaches were examined using probability of nonexceedance plots for the 32 evaluation catchments. 2.3. Impacts of streamflow uncertainty on calibration and evaluation and parameter estimation

Fig. 2. Scatter plot of stage height (m) vs. rating measurements (black crosses) and streamflow derived from the rating curve (black points). The upper range of the rating curve is determined by the highest rating on record. The data to the right (and to the left when a lower limit exists) are considered extrapolated and censored.

van Dijk and Renzullo, 2011). Full technical details of the version of the model can be found elsewhere (van Dijk, 2010). A single set of AWRA-L parameters was optimised for 33 catchments for 1980e2008. The resulting optimal parameter set was used for the remaining 32 catchments to evaluate the simulations on censored and uncensored streamflow. Catchments for calibration and prediction were randomly picked to cover the whole geographical domain. The global optimisation was performed through a genetic algorithm optimiser. AWRA-L was calibrated globally according to minimization of a ‘grand’ objective function reflecting performance of the model across all sites. Firstly, the following objective function (F), comparing observed and modelled flow for each catchment, is computed as

F ¼ 1  NSE;

(1)

where NSE is the Nash-Sutcliffe Efficiency of daily streamflow. Then, a composite summary or ‘grand’ objective function (Fg) is calculated by considering the 1st, 25th, median and 75th percentiles (F1%, F25%, F50% and F75% respectively) from the 33 objective function values

Fg ¼ F1% þ F25% þ F50% þ F75% :

(2)

The calibration process then optimizes the model parameters to minimise Fg. The 99th percentile was not used because F values for over 90th percentile catchments are one to two orders of magnitude higher, placing too much weight on poorly calibrated catchments. In this way, streamflow (and/or climate) data deemed of poor quality are removed from the calibration. For more technical details about the calibration procedure please refer to Zhang et al. (2011). The calibration was repeated six times; using three NSE-based metrics for both the original and the censored streamflow records, respectively. In addition to the standard NSE metric (stdNSE) that places more weight on high flows, we repeated the exercise calculating NSE for log-transformed data (logNSE, placing more weight on low flows) and square-root-transformed data (sqrtNSE, compromising weight on high and low flows), respectively. In each of the six calibrations, the optimised parameter set was used to predict streamflow in the remaining 32 catchments and evaluated against both censored and uncensored streamflow data; creating a

There have been several published approaches to deal with rating curve uncertainty including frequentist (Petersen-Overleir, 2004) and Bayesian methods (Petersen-Øverleir et al., 2009; Reitan and Petersen-Øverleir, 2011; Hrafnkelsson et al., 2012). We used the procedure developed by Pagendam and Welsh (2011) to estimate rating curve uncertainty and generate an ensemble of feasible streamflow realisations based on the gauge height. The procedure uses the non-parametric Weighted Nadaraya-Watson (WNW) estimator (Hall et al., 1999; Cai, 2001) to estimate the conditional cumulative distribution function of streamflow given an observation of rating and a time in the station's history. This estimator has a number of desirable properties, reproducing the superior bias properties of locally linear estimators, like the estimator proposed by Yu and Jones (1998), but additionally, unlike the Yu-Jones estimator, it always returns a valid distribution function. For our purposes herein, we define the WNW estimator of the distribution function as

  b F yt jDt ¼d;T ¼t Pr

i¼1 pi

¼



       d;t;Y;D;T Kh1 dDi Kh2 t Ti І Yi  yt

      i¼1 pi d;t;Y;D;T Kh1 dDi Kh2 t Ti

Pr

;

(3)

where y,d and t are the log-discharge, stream stage and observation time respectively. Y,D and T are r-dimensional vectors containing the respective observations of log-discharge, stage and the observation times, that are used to build the rating curves for the gauging station (note that these data are distinct from the time series data of stream stage used to model flow). The functions Kh1 ð,Þ and Kh2 ð,Þ are Gaussian kernel functions with bandwidths h1 and h2 respectively and І(,) is the indicator function, which takes the value of one if the expression in brackets is true and zero otherwise. Finally, pi(d,t,Y,D,T) are a set of additional weights, that are a function of the rating curve data as well as the time (t) and stage (d) at which the distribution is being computed. These weights have the property P that each pi  0, ni¼1 pi ¼ 1, in addition to r  X        Di  d pi d; t; D; T Kh1 d  Di Kh2 t  Ti ¼ 0 and

(4)

i¼1 r  X        Ti  t pi d; t; D; T Kh1 d  Di Kh2 t  Ti ¼ 0:

(5)

i¼1

The weights satisfying P pi ¼ Mi = ri¼1 Mi , where

these

constraints

     Mi ¼f1 þ a1 Kh1 d  Di Kh2 t  Ti d  Di      þ a2 Kh1 d  Di Kh2 t  Ti t  Ti g1 ;

are

equal

to

(6)

and a1 and a2 are Lagrange multipliers. The {pi} satisfying the above constraints are not uniquely defined and the accepted P approach is to choose a1 and a2 in order to maximise ri¼1 logðpi Þ.

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

36

The procedure outlined above makes it possible to estimate the conditional distribution function (CDF) of streamflow. The incorporation of time into the estimator produces a CDF that varies smoothly through time to incorporate observed changes in the rating curve. The bandwidths, h1 and h2 were chosen by fitting a kernel regression between log-flow, stage and time (Hayfield and Racine, 2008). Firstly, in order to estimate a streamflow uncertainty ‘envelope’ for a streamflow time-series, we conducted 10,000 Monte-Carlo simulations as follows: 1. Set N ¼ 10,000 (number of simulations) and set i ¼ 1. 2. Generate n uniformly distributed random variables, (Ui,1,…,Ui,n) over the interval [0,1]. 3. Denote the stage at time tj asDj. Then for each element Dj in the   F ð,D ¼ Dj ; T ¼ tj Þ and time series of stages (D1,...,Dn), calculate b obtain samples of log-flow Yi,j by inverting the estimated con 1  ditional distribution function: Yi;j ¼ b F ðUi;j D ¼ Di;j ; T ¼ tj Þ. Denote the ensemble members as Y i¼ (Yi,1,...,Yi,n). 4. If i < N, increment i according to i ¼ i þ 1 and go to step 2, otherwise return the ensembles Y1,...,Y10,000. It is assumed here that the method provides a reliable estimate of the distribution of flow given stage with 10,000 ensemble members. The approach can be computationally demanding because the Lagrange multipliers present in Equation (6) must be computed for every combination of stage height and time on some appropriately fine grid. Techniques to improve these approach and speed of computation are currently under investigation. Secondly, to derive realisations of streamflow uncertainty; the 10,000 streamflow ensemble simulations are split in batches of 100. This is done arbitrarily in order to create an equal number of plausible streamflow realisations in order to illustrate how uncertainty in the hydrograph can impact on calibration. Thirdly, to introduce this uncertainty in a modified NSE; upper and lower uncertainty bounds were set by ranking each streamflow realisation and choosing the 5th percentile (Q5), 50th percentile (Q50) and 95th percentile (Q95) values (e.g. McMillan et al., 2010). We use the 90% inter-quantile range as an estimate of uncertainty (DQo) for each day in the 100 simulated streamflow time series as

DQo ¼ Q95  Q5 ;

 2 gi Qo;i  Qm;i  2 ; Pn Q g  Q o o;i i¼1 i

Pn

i¼1

1 DQo;i

2 ;

(9)

where Qo is the observed streamflow (defined here as Q50) and Qm is the modelled streamflow. We chose station 206001 (Fig. 1, catchment drainage area coloured in red) to conduct a local split-sample calibration/evaluation experiment (Klemes, 1986), for the following reasons: it used a single rating curve during the calibration period (1980e1997) and two during the valuation period (1998e2008), and all three were similar, suggesting a relatively stable stageedischarge relationship throughout both periods. Streamflow was almost perennial and the station showed moderate residual heteroscedasticity (Fig. 3a). A key objective of this experiment is to test the impacts of streamflow uncertainty on parameter estimation. We selected a parsimonious rainfall-runoff model, GR4J, which has only four calibration parameters (Le Moine et al., 2007; Vaze et al., 2011; Lerat et al., 2012). Therefore, the use of GR4J makes results more interpretable when assessing parameter space, compared to the 28-parameter AWRA-L model. The four calibrated parameters are: (1) x1, the capacity of the production store (mm); (2) x2, the intercatchment groundwater exchange coefficient (mm); (3) x3, the capacity of the nonlinear store (mm) and (4) x4, the unit hydrograph time (days). A detailed description of the model can be found in Perrin et al. (2003). The local optimisation was also performed using the genetic algorithm optimiser. For station 206001, we generated 10,000 ensembles; splitting the ensembles into batches of 100 we obtained 100 ‘observed’ streamflow realisations and corresponding uncertainty envelopes (DQo). The 50th percentile; used in this experiment as the ‘observed’ streamflow (Qo ¼ Q50) was compared to ‘Pinneena’ streamflow to ascertain its robustness and accuracy. These streamflow realisations were used in split sample calibration (1990e1998) and evaluation (1999e2008) using the three versions of NSE (stdNSE, stdMNSE and sqrtMNSE) OBJs. Thus we obtained 300 (3 times 100) optimal parameter sets. Again, we repeated the experiment using two streamflow transformations (logarithmic and square-root). Differences between the three calibration approaches were examined using probability of non-exceedance plots. Further we compared the impact of including the computed rating curve uncertainty on model parameters using frequency distributions plots.

(7)

We choose Q5 and Q95 to represent a high level of uncertainty in the weighting factors and assess the effect on model calibration/ evaluation. For testing purposes, we picked eight stations that had rigid structures for control (V-notch, other concrete reinforced structures or a rocky bed) and also showed a relatively stable rating curve (usually associated with a quasi-stationary river bed) to compare streamflow estimated using the observed rating curve against the 100 streamflow realisations derived from the Pagendam and Welsh (2011) method. Depending on the hydrological application, some quantiles might be more appropriate than others. The estimated uncertainty was incorporated in modified NSE through an optimal weighted average approach (Aitken, 1935). The modified NSE is defined based on the work of Croke (2007) as

MNSE ¼ 1 

gi ¼ 

(8)

and where the weight gi are obtained from streamflow uncertainty as

3. Results 3.1. Heteroscedasticity and extrapolation in rating curves Plots of stage vs. maximum-normalized residuals between rating measurements and the rating curve were visually examined for evidence of heteroscedasticity. The classification resulted in 29 stations (45%) showing the characteristic trumpet shape; these were classified as type A (moderate to strong evidence of heteroscedasticity). Another 14 (21%) were classified as type B (none to slight evidence of heteroscedasticity), and finally 22 (34%) stations were classified as type C (inconclusive evidence). Fig. 3 illustrates an example of each type. There was no apparent relation between geographic location and heteroscedasticity type. The procedure of censoring streamflow data in the 65 stations eliminated an average of 0.65% of the record due to high flow extrapolation and 4.8% of the record due to low flow extrapolation. As a result, the mean daily streamflow declined by 14% (even with the exclusion of the low flow values), from 0.50 ± 0.46 mm d1 (standard deviation) to 0.43 ± 0.39 mm d1. The scatter plot in Fig. 4a shows that most of the reduction occurred generally in

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

37

Fig. 3. Example of stations showing (a) moderate to strong evidence of heteroscedasticity as type A; (b) none to slight evidence of heteroscedasticity as type B and (c) inconclusive evidence as type C.

stations with higher mean daily streamflow. Censoring can drastically eliminate very high flows, as can be seen in the flow duration curve (FDC) for station 204034 (Fig. 4b). 3.2. Impacts of censoring streamflow data in a regional calibration experiment Plots of the probability of non-exceedance for each of the three NSE OBJs in the regional calibration/evaluation experiment are shown in Fig. 5. Calibration results using uncensored flows generally outperformed or where close to those of using censored flows

(Fig. 5a, c and e). Table 1 summarises the mean NSE of regional calibration results, which are grouped by 11 best calibrated catchments (‘top’), followed by the 11 next best calibrated catchments (‘middle’) and the 11 worst calibrated catchments (‘bottom’) and the calibration values of the ‘grand’ objective functions (Fg). Overall (according to Fg), the uncensored data calibrations were superior, with the greatest impact on ‘top’ and ‘middle’ catchments in terms of 1-NSE. The censored calibrations performed better for the ‘bottom’ catchments. The least degradation (that is, decrease of NSE for censored calibrations in relation to uncensored calibrations) in the ‘top’ and ‘middle’ calibrated catchments was for sqrtNSE, whereas

Fig. 4. (a) Scatter plot of uncensored vs. censored mean annual streamflow (Q) and (b) example of the effects of censoring on the flow duration curve (FDC) of station 204034.

38

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

Fig. 5. Summary of regional calibration results. Probability of non-exceedance plots for: calibration and prediction of (aeb) stdNSE respectively, (ced) logNSE and (eef) sqrtNSE. Abbreviations ‘uneun’ and ‘cenecen’correspond to uncensored parameters evaluated on uncensored flow and censored parameters evaluated on censored flows respectively and ‘un-cen’ and ‘cen-un’ correspond to uncensored parameters evaluated on censored flow and censored parameters evaluated on uncensored flows respectively.

the most degradation was for logNSE. The NSE differences in the three variants of calibration between censored and uncensored streamflow (Table 1) were not statistically significant at the 95% confidence interval (two-tailed t-test). On the other hand, the differences between calibration variants were significant (at the 95% confidence interval) between sqrtNSE and the other two variants (for both uncensored and censored flows), whereas there were no significant differences between NSE and logNSE (again, for both uncensored and censored flows).

In evaluation, the overall best results were obtained for predictions using parameter sets based on uncensored calibration data (‘uncensored calibration’), evaluated against uncensored data (‘uncensored evaluation’). This was followed by the combination uncensored calibration/censored evaluation (Fig. 5b, d and f and Table 2). The degradation in performance from calibration to evaluation was variable, but the least degradation was obtained using sqrtNSE in calibration (whether with uncensored or censored data).

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen Table 1 ‘Grand’ objective functions (Fg) values and difference, means and difference of the three NSE OBJs (stdNSE for flows, logNSE for log-transformed flows and sqrtNSE for square-root-transformed flows) values for calibration results using the uncensored and censored streamflow data. ‘Top’ results correspond to the mean of the best 11 calibrated catchments, followed by the mean of the next best 11 calibrated catchments (‘middle’) and the worst calibrated 11 catchments (‘bottom’). The best optimisation values for each NSE Fg in bold. Note that Fg, is calculated from the 1, 25, 50 and 75 percentiles (F g¼ F1% þ F25% þ F50% þ F75%) from each and every catchment F ¼ 1 e NSE. As opposed to NSE, lower Fg denotes better calibration/validation. Calibration

NSE Top Middle Bottom Fg sqrtNSE Top Middle bottom Fg logNSE Top Middle Bottom Fg

Streamflow data used Uncensored

Censored

Difference

0.614 0.431 0.175 0.562

0.588 0.394 0.239 0.585

0.026 0.038 0.064 0.023

0.602 0.421 0.185 0.439

0.580 0.390 0.201 0.442

0.022 0.032 0.016 0.003

0.586 0.375 0.048 0.549

0.552 0.342 0.136 0.549

0.034 0.033 0.088 0

3.3. Impacts of streamflow uncertainty on calibration and prediction and parameter estimation Not surprisingly, WNW-generated streamflow realisations for the eight stations with a quasi-stationary rating curve showed good agreement with the streamflow record from the database, in terms of correlation coefficient as well as mean (Table 3). Fig. 6 shows an example of streamflow time series for observed and the 100 simulated ensembles for the year 2003 for three stations. That year had higher than average annual streamflow (1980e2008). It can be seen that the ensembles emulate the observed hydrograph quite well, without noticeable differences, whereas the resulting uncertainty envelope used here effectively places upper and lower uncertainty bounds. The 100 streamflow realisations were used to set upper and lower uncertainty bounds by choosing the 5th percentile (Q5) and 95th percentile (Q95). The uncertainty in the observed flows DQo (the difference between Q95 and Q5) for three stations is shown in Fig. 7 plotted against Q50. Uncertainty appears to increase linearly

39

with Q50 in most of the cases examined. Some very low flows also show rather high uncertainty estimates. Another way of visualizing the uncertainty envelope is in the FDC for the 100 simulations of Q50, Q95 and Q5 for station 206001 (Fig. 8). The uncertainty in estimated parameter space was substantially reduced if the MNSE is used in calibration compared to stdNSE (Fig. 9); implying overall estimation uncertainty is reduced when using the same data transformation. Parameters values (except for x4, which did not exhibit a clear pattern either for stdNSE or sqrtMNSE) of sqrtMNSE were closer to those of the more stable parameters of MNSE and appeared to have less uncertainty than stdNSE. Table 4 shows the mean and standard deviation parameter values, although it was clear that the means of sqrtMNSE were closer to MNSE, parameter uncertainty results (in terms of standard deviation) were mixed. The impacts of the different parameter sets on evaluation (1998e2008) in terms of different versions of NSE are shown in Fig. 10. When NSE is considered as the goodness-of-fit metric (i.e. giving more weight to the correct simulation of high flows), both parameter sets obtained using stdNSE and sqrtMNSE perform similarly, whereas the poorer performance of parameter sets obtained using MNSE is clear for all 100 predictions. If sqrtNSE is used for model performance evaluation (i.e. compromising both for high and low flows) MNSE performs best, followed by sqrtMNSE. Predictions of stdNSE are substantially degraded when compared to the other two. This is also the case when considering logNSE as goodness-of-fit metric, with MNSE and sqrtMNSE performing similarly. 4. Discussion 4.1. Heteroscedasticity and extrapolation in rating curves This paper ascertained the occurrence of residual heteroscedasticity in 65 streamflow stations located in southeast Australia. For the station studied here, rating curves are derived using piecewise regressions of a logarithmic or linear form. Upon visual inspection, residuals between measured streamflow and values estimated with these curves showed evidence of heteroscedasticity in 29 stations, with higher residual values at higher stage values. Results were inconclusive in 22 stations, mainly due to the scarcity of rating measurements at higher stages. The observed type of heteroscedastic residuals emphasises the high uncertainty potentially caused by extrapolation of rating curves. There is scant published literature on how common residual heteroscedasticity is

Table 2 ‘Grand’ objective functions (Fg) values and difference, means and difference of the three NSE OBJs (stdNSE for flows, logNSE for log-transformed flows and sqrtNSE for squareroot-transformed flows) values for prediction results using the uncensored and censored streamflow data. ‘Top’ results correspond to the mean of the best 10 calibrated catchments, followed by the mean of the next best 11 calibrated catchments (‘middle’) and the last worst 11 calibrated catchments (‘bottom’). The best optimisation values for Fg for each NSE in bold. Cal. parameters

Fg

Top

Middle

Bottom

Fg

Top

Mean NSE

Middle

Bottom

Mean NSE Streamflow data used for prediction

Uncensored stdNSE uncensored stdNSE censored Difference sqrtNSE uncensored sqrtNSE censored Difference logNSE uncensored logNSE censored Difference

0.530 0.599 0.069 0.402 0.439 0.037 0.578 0.621 0.043

0.650 0.609 0.041 0.737 0.717 0.020 0.630 0.614 0.016

0.476 0.412 0.064 0.627 0.574 0.053 0.474 0.460 0.013

Censored 0.303 0.218 0.085 0.462 0.394 0.067 0.210 0.112 0.098

0.553 0.592 0.039 0.424 0.456 0.032 0.514 0.523 0.009

0.621 0.619 0.002 0.710 0.693 0.017 0.615 0.613 0.002

0.446 0.403 0.043 0.608 0.569 0.039 0.469 0.438 0.030

0.227 0.208 0.019 0.449 0.386 0.063 0.176 0.072 0.104

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

40

Table 3 Correlation coefficient of observed and simulated daily streamflow (Qobs,mean vs. Q50,mean) and mean (1980e2008) observed and simulated streamflow for Qobs and Q50, Q95, Q5 respectively. Station ID

203005

204041

204906

206001

208019

213200

215002

410057

r (Qobs,mean vs. Q50,mean) Streamflow (mm d1) Qobs,mean Q50,mean Q95,mean Q5,mean

0.97

1.00

0.99

0.99

0.99

0.80

0.98

1.00

1.09 1.05 1.17 0.95

1.06 1.09 1.38 0.85

1.61 1.56 1.79 1.36

1.51 1.48 1.60 1.37

0.84 0.85 1.02 0.72

1.05 0.92 0.99 0.87

0.38 0.37 0.44 0.31

1.17 1.14 1.31 0.99

in other parts of the world (e.g. Petersen-Øverleir (2004) for Norwegian catchments). This may partially be due to the lack of publicly available information on rating measurements and rating curves. Rating information such as that provided in by the ‘Pinneena’ database (NoW, 2011) is the first step to improve this situation. It is noted that the choice of stage-discharge measurements used in regression to derive the rating-curve may have an undue influence in the nature of the residuals. This may result in different heteroscedasticity if the regression is performed on log-, semilog-, or non-transformed stage-discharge values. We also assessed the reliability of the quality codes that were used to flag streamflow data. Although NoW gauging sampling in the 65 stations has captured high flow conditions to some extent, there were a number of stations in which extrapolation still accounted for a large proportion of high flows (and also low flows) yet these were labelled as of good quality. In these stations, additional sampling campaigns and a ground survey of the stream bed during and after higher than average flow conditions are required to derive more robust rating curves. In addition, recent methods based on airborne LiDAR can be used to support sampling campaigns in remote and/or inaccessible locations during flood events (e.g. Nathanson et al., 2012).

4.2. Impacts of censoring streamflow data in a regional calibration experiment The impact censoring extrapolated flows on regional calibration and evaluation of AWRA-L appeared lower than expected. Surprisingly, the optimised parameter sets obtained by calibration against uncensored data generally outperformed those obtained by calibration against censored data for both calibration and validation, whether validation used censored or uncensored data. Censoring resulted in sites with high flows (presumably not affected much by censoring) which are harder to fit with the censored calibration. This can be inferred from the results in Table 1, where mean NSEs are higher for the uncensored calibrations for the 11 top and 11 middle catchments, whichever OBJ was used. The results for the 11 bottom calibrated catchments are less reliable because they are very sensitive to a few catchments with very poor calibration performance (e.g. Figs. 5a and c). Both of these results reflect the focus of Fg on top and middle catchments, with the exclusion of very low catchments in the grand OBJ, indirectly removing data deemed of poor quality. The uncensored calibration must introduce some extra data that is harder to fit for a few very poor performing sites, which are alluded to in Fig. 5a. The least degradation (that is, decrease of NSE for censored calibrations in relation to uncensored calibrations) was for the NSE using square-root transformation, whereas the most degradation was for the log-transformation. The square-root transformation was also found by Oudin et al. (2006) to be a good compromise for an ‘all-purpose’ model, giving appropriate weight to both low and high flows. This result is of particular note for predictions in ungauged basins where a large amount of catchment streamflow data (as in this paper) used in regional calibration might yield better results than parameter regionalisation techniques (Parajka et al., 2007; Vaze and Teng, 2011). It should be noted that by using log-transformed flows we restricted the calibration and prediction to non-zero streamflow values, therefore in practice using different datasets for comparisons. Although this is not desirable (zero flows have to be predicted as well), this transformation is commonly used to assess the performance of rainfall-runoff models to simulate low flows (e.g. Pushpalatha et al., 2012). The approach of Chiew et al. (1993), which used a power transformation of 0.2 to place more weight on low flows, could be used instead of log transformations.

4.3. Impacts of streamflow uncertainty on calibration and evaluation and parameter estimation

Fig. 6. Example of time series of observed and 100 simulated ensemble daily streamflow (Q) for the year 2003 for stations (a) 203005, (b) 204041 and (c) 204906.

Implementing or modifying existing methods to develop rating curves which account for uncertainty is not only a research challenge but also a management need; particularly in instances were extrapolation is almost inevitable as in release of flows in dam operations and flood response (Joint Flood Taskforce, 2011; Di Baldassarre and Uhlenbrook, 2012).

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

41

Fig. 7. Example of scatter of Q50 vs. daily uncertainty (DQo) obtained from 100 simulated ensemble daily streamflow (Q) for stations (a) 206001, (b) 213200 and (c) 215002.

The Pagendam and Welsh (2011) method trialled in this paper to account for rating curve uncertainty was capable of reproducing streamflow very well when compared to streamflow estimated with the operational rating curves. The method can give weights to stage measurements depending on the time they were taken, thus indirectly considering non-stationarity in the stageedischarge relationship. Here, however, we only considered 8 stations that had stable rating curves partly because the control was rigid (V-notch or concrete regulation structure). The performance of the method was not tested in stations with non-stationary rating curves, thought to exist at most of the remaining 58 stations. In addition, since the method relies on training data to derive plausible distribution functions of streamflow conditional on the observed stage, and since measurements at high flows are scarce, heteroscedasticity is addressed by using Monte-Carlo simulations. Further, the 10,000 streamflow ensemble simulations are split in batches of 100 to create an equal number of plausible streamflow realisations and to set upper and lower uncertainty bounds. This resulted in a positive linear relationship between streamflow and uncertainty in the 8 stations examined (as commonly assumed) (McMillan et al., 2010), highly penalising flows in the high range, although low flows were

highly penalised in some cases (Fig. 7). Possible explanation for the high penalisation of low flows is the often higher number of ratings for which the WNW estimates the conditional distribution function of streamflow given an observation of rating (e.g. see scatter in Fig. 2). Also, the conditional distribution function can be affected by hysteresis and also by measurement error, which can be relatively large at low stages (Tomkins and Davidson, 2011). The computed uncertainty bounds were incorporated in the optimisation procedure in a local calibration experiment through an optimal weighted average approach (Croke, 2007). We used the simpler form of Croke's modification which does not take into account short and long-term serial correlation in the observed flows (Croke, 2009). We considered that this simpler approach was sufficient to assess the impacts of including streamflow uncertainty on parameter uncertainty and calibration/evaluation. A reduction of parameter uncertainty was achieved with the use of the modified NSE, however predictions of high flows using parameter sets calibrated using the modified NSE as the objective function were substantially worse that those using the standard NSE or the modified NSE using a square-root transformation of streamflow. In model calibration, the modified NSE applied very low weights to high flows due to the uncertainty of high flow estimates. The modified NSE using a square-root-transformation of streamflow was the best compromise both for parameter uncertainty reduction and prediction of different streamflow regimes. This result reinforces the previous regional calibration results. Probabilistic approaches exist to calibrate hydrologic models and better use the large amount of information provided by the Pagendam and Welsh (2011) method. Maximum likelihood OBJs (rather than the modified NSE) can be trialled using the probability distribution function (PDF) of the streamflow ensembles (e.g. Sorooshian and Dracup, 1980; Harmel and Smith, 2007). If the PDF follows a normal distribution, these functions are similar to the modified NSE proposed by Croke (2007), but unlike the modified NSE, maximum likelihood OBJs can also handle asymmetric uncertainty distributions not uncommon in streamflow data (Warren Jin, CMIS, pers. comm.). 5. Conclusion

Fig. 8. Flow duration curves for the 100 simulations of Q50, Q95 and Q5 for station 206001.

Analysis of stage-discharge data and operational for 65 unregulated catchments in New South Wales, Australia, revealed the incidence of higher residual values at higher stage values (heteroscedasticity) in the stageedischarge relationship (the ‘rating curve’). In addition, streamflow quality codes were not always

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

42

Fig. 9. Frequency distribution plots for the four calibrated (1980e1997) parameters of GR4J in the 100 simulations for station 206001 using stdNSE, MNSE and sqrtMNSE respectively.

reliable, with an important number of extrapolated flows (both in the high and low flow range) remaining in the streamflow data although they were deemed of ‘good’ quality (i.e. not extrapolated). The implications of using these streamflow and streamflow transformations to minimise their impact on rainfall-runoff modelling was assessed through two experiments. (i) The first experiment assessed the impacts of including uncertain (i.e. extrapolated) streamflow data in a regional

Table 4 Mean and standard deviations for the four parameters of GR4J obtained from calibrating the model using three different OBJs (stdNSE, MNSE and sqrtNSE) and 100 streamflow realisations per calibration, making a total of 300 different parameter sets. Parameter

x1 x2 x3 x4

stdNSE

MNSE

sqrtNSE

Mean (stdev)

Mean (stdev)

Mean (stdev)

77.25 5.46 156.85 1.089

56.41 0.15 195.33 0.99

59.62 1.64 185.88 1.062

(±6.53) (±0.32) (±11.51) (±0.037)

(±17.31) (±0.07) (±4.18) (±0.01)

(±5.6) (±0.3) (±20.03) (±0.02)

calibration/prediction experiment using a process-based landscape water balance model, in which a single set of parameters is calibrated in 33 catchments (with different climatic and biophysical characteristics) using either censored or uncensored (i.e. no extrapolation) streamflow data and predictions evaluated (using the Nash-Sutcliffe efficiency, NSE) in the remainder 32 catchments. We repeated the experiment using two streamflow transformations (logarithmic and square-root) in an attempt to reduce heteroscedasticity. The experiment demonstrated that the impact of censoring extrapolated flows was small in terms of model calibration and prediction. The optimised parameter sets obtained by calibrating using uncensored data generally outperformed those obtained by calibrating against censored data, in terms of prediction in both censored and uncensored datasets. Censoring resulted in sites with high flows (presumably not affected much by censoring) which are harder to fit with the censored calibration. In addition, our results support previous findings that a square-root transformation of flows was a sensible compromise between achieving good performance for both high and low flows.

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

43

estimator were used to derive rating curve uncertainty bounds for 100 streamflow realisations. These were later used in split-sample calibration/evaluation in a single station using a standard NSE OBJ and a modified NSE OBJ, which penalised uncertain flows. Again, we repeated the experiment using two streamflow transformations (logarithmic and square-root). By commensurately penalising uncertain flows, the parameter uncertainty decreased. This local calibration experiment also showed that the modified NSE using a square-root-transformation of streamflow was a sensible compromise both for parameter uncertainty reduction and prediction of different streamflow regimes. Methods exist both to calibrate hydrological models and predict streamflow whilst including various sources of uncertainty. A comprehensive streamflow database which includes estimates of uncertainty, in conjunction with properly constrained regional calibration schemes has the potential to enhance predictions in ungauged catchments and improve the predictive capabilities of operational systems (e.g. streamflow forecasting services, flood monitoring and dam operations).

Acknowledgements This work is part of the water information research and development alliance between the Bureau of Meteorology and CSIRO's Water for a Healthy Country Flagship. Kerry Tomkins, Q. J. Wang, Ming Li and David Robertson from CSIRO Land and Water Flagship, Warren Jin from CSIRO Mathematics, Informatics and Statistics (CMIS) and Sri Srikanthan from the Bureau of Meteorology are also thanked for providing valuable comments on the research. We would also like to acknowledge the valuable comments of three anonymous reviewers.

References

Fig. 10. Probability of non-exceedance plots for 100 predictions using different NSEs as goodness-of-fit-metrics: (a) stdNSE, (b) sqrtNSE and (c) logNSE.

(ii) The second experiment investigated the impact of rating curve uncertainty on the NSE and parameter estimation in a calibration/evaluation experiment using a parsimonious conceptual model. A Monte-Carlo approach and the nonparametric Weighted Nadaraya-Watson (WNW)

Aitken, A.C., 1935. On least squares and linear combination of observations. Proc. R. Soc. Edinb. 55, 42e48. Bennett, N.D., Croke, B.F.W., Guariso, G., Guillaume, J.H.A., Hamilton, S.H., Jakeman, A.J., Marsili-Libelli, S., Newham, L.T.H., Norton, J.P., Perrin, C., Pierce, S.A., Robson, B., Seppelt, R., Voinov, A.A., Fath, B.D., Andreassian, V., 2013. Characterising performance of environmental models. Environ. Model. Softw. 40, 1e20. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. R. Stat. Soc. Ser. B-Stat. Methodol. 26 (2), 211e252. Cai, Z.W., 2001. Weighted Nadaraya-Watson regression estimation. Stat. Probab. Lett. 51 (3), 307e318. Chiew, F.H.S., McMahon, T.A., Inst Engineers, A., 1993. Data and rainfall-runoff modeling in Australia. In: Hydrology and Water Resources Symposium e Towards the 21st Century: Preprints of Papers, pp. 305e316. Croke, B.F.W., 2007. The role of uncertainty in design of objective functions. In: Modsim 2007: International Congress on Modelling and Simulation: Land, Water and Environmental Management: Integrated Systems for Sustainability, pp. 2541e2547. Croke, B.F.W., 2009. Representing uncertainty in objective functions: extension to include the influence of serial correlation. In: 18th World Imacs Congress and Modsim09 International Congress on Modelling and Simulation: Interfacing Modelling and Simulation with Mathematical and Computational Sciences, pp. 3372e3378. Di Baldassarre, G., Uhlenbrook, S., 2012. Is the current flood of data enough? A treatise on research needs for the improvement of flood modelling. Hydrol. Process. 26 (1), 153e158. Domeneghetti, A., Castellarin, A., Brath, A., 2012. Assessing rating-curve uncertainty and its effects on hydraulic model calibration. Hydrol. Earth Syst. Sci. Discuss. 8, 10501e10533. Hall, P., Wolff, R.C.L., Yao, Q.W., 1999. Methods for estimating a conditional distribution function. J. Am. Stat. Assoc. 94 (445), 154e163. Harmel, R.D., Smith, P.K., 2007. Consideration of measurement uncertainty in the evaluation of goodness-of-fit in hydrologic and water quality modeling. J. Hydrol. 337 (3e4), 326e336. Hayfield, T., Racine, J., 2008. Nonparametric Econometrics: the np Package. J. Stat. Softw. 27 (5).

44

~ a-Arancibia et al. / Environmental Modelling & Software 63 (2015) 32e44 J.L. Pen

Hrafnkelsson, B., Ingimarsson, K.M., Gardarsson, S.M., Snorrason, A., 2012. Modeling discharge rating curves with Bayesian B-splines. Stoch. Environ. Res. Risk Assess. 26 (1), 1e20. Jeffrey, S.J., Carter, J.O., Moodie, K.B., Beswick, A.R., 2001. Using spatial interpolation to construct a comprehensive archive of Australian climate data. Environ. Model. Softw. 16 (4), 309e330. Joint Flood Taskforce, 2011. Joint Flood Taskforce Report. Queensland Flood Commission of Inquiry available at: http://www.floodcommission.qld.gov.au/ (last accessed 1.10.14.). Klemes, V., 1986. Operational testing of hydrological simulation-models. Hydrol. Sci. J.-J. Des Sci. Hydrol. 31 (1), 13e24. Le Moine, N., Andreassian, V., Perrin, C., Michel, C., 2007. How can rainfall-runoff models handle intercatchment groundwater flows? Theoretical study based on 1040 French catchments. Water Resour. Res. 43 (6). WR005608. Lerat, J., et al., 2012. Do internal flow measurements improve the calibration of rainfall-runoff models? Water Resour. Res. 48 (2). WR010179. McMillan, H., Freer, J., Pappenberger, F., Krueger, T., Clark, M., 2010. Impacts of uncertain river flow data on rainfall-runoff model calibration and discharge predictions. Hydrol. Process. 24 (10), 1270e1284. Nathanson, M., et al., 2012. Modeling rating curves using remotely sensed LiDAR data. Hydrol. Process. 26 (9), 1427e1434. Nash, J.E., Sutcliffe, J.V., 1970. River flow forecasting through conceptual models, part 1: a discussion of principles. J. Hydrol. 10, 282e290. NoW, 2011. PINNEENA 9 available at: http://waterinfo.nsw.gov.au/pinneena/ pinneena-cm9.pdf (last accessed 1.10.14.). Oudin, L., Andreassian, V., Loumagne, C., Michel, C., 2006. How informative is landcover for the regionalization of the GR4J rainfall-runoff model? Lessons of a downward approach. In: Large Sample Basin Experiments for Hydrological Model Parameterization: Results of the Model Parameter Experiment, pp. 246e255. Pagendam, D.E., Welsh, A.H., 2011. Statistical estimation of total discharge volumes. In: MODSIM2011, 19th International Congress on Modelling and Simulation. Modelling and Simulation Society of Australia and New Zealand, December 2011, pp. 3525e3531. Parajka, J., Bloeschl, G., Merz, R., 2007. Regional calibration of catchment models: potential for ungauged catchments. Water Resour. Res. 43 (6). Perrin, C., Michel, C., Andreassian, V., 2003. Improvement of a parsimonious model for streamflow simulation. J. Hydrol. 279 (1e4), 275e289. Petersen-Øverleir, A., 2004. Accounting for heteroscedasticity in rating curve estimates. J. Hydrol. 292 (1e4), 173e181. Petersen-Øverleir, A., Soot, A., Reitan, T., 2009. Bayesian rating curve inference as a streamflow data quality assessment tool. Water Resour. Manag. 23 (9), 1835e1842. Pushpalatha, R., Perrin, C., Le Moine, N., Mathevet, T., Andreassian, V., 2012. A downward structural sensitivity analysis of hydrological models to improve low-flow simulation. J. Hydrol. 411 (1e2), 66e76.

Reitan, T., Petersen-Øverleir, A., 2011. Dynamic rating curve assessment in unstable rivers using Ornstein-Uhlenbeck processes. Water Resour. Res. 47. WR009504. Sorooshian, S., Dracup, J.A., Stochastic parameter estimation procedures for hydrologie rainfall-runoff models: correlated and heteroscedastic error cases. Water Resour. Res. 16(2), WR016i002p00430. Tomkins, K.M., Davidson, A.J., 2011. Propagation of input errors: implications for model simulations and risk analysis. In: Bloschl, G.T.K.J.S.F.A.S.A. (Ed.), Risk in Water Resources Management. IAHS Publication, pp. 101e106. van Dijk, A.I.J.M., 2010. AWRA Technical Report 3. Landscape Model (Version 0.5) Technical Description, Canberra available at: http://www.clw.csiro.au/ publications/waterforahealthycountry/2010/wfhc-aus-water-resourcesassessment-system.pdf, (last accessed 01.10.14.). van Dijk, A.I.J.M., Renzullo, L.J., 2011. Water resource monitoring systems and the role of satellite observations. Hydrol. Earth Syst. Sci. 15 (1), 39e55. van Dijk, A.I.J.M., Warren, G.A., 2010. AWRA Technical Report 4. Evaluation Against Observations, Canberra available at: http://www.clw.csiro.au/publications/ waterforahealthycountry/2010/wfhc-awras-evaluation-against-observations. pdf, (last accessed 01.10.14.). Vaze, J., et al., 2011. Rainfall-runoff modelling across southeast Australia: datasets, models and results. Aust. J. Water Resour. 14 (2), 101e116. Vaze, J., Viney, N., Stenson, M., Renzullo, L., van Dijk, A., Dutta, D., Crosbie, R., Lerat, J., Penton, D., Vleeshouwer, J., Peeters, L., Teng, J., Kim, S., Hughes, J., Dawes, W., Zhang, Y., Leighton, B., Frost, A., Elmahdi, A., Smith, A., Daamen, C., Argent, R., 2013. The Australian water resource assessment system (AWRA). In: Proceedings of the 20th International Congress on Modelling and Simulation, Adelaide, Australia, December 2013, pp. 3015e3021. Vaze, J., Teng, J., 2011. Future climate and runoff projections across New South Wales, Australia e results and practical applications. Hydrol. Process 25 (1), 18e35. http://dx.doi.org/10.1002/hyp.7812. Viney, N.R., van Dijk, A.I.J.M., Vaze, J., 2011. Comparison of models and methods for estimating streamflow across Australia. In: WIRADA Science Symposium, 1e5 August 2011, Melbourne, Australia. Westerberg, I., Guerrero, J.L., Seibert, J., Beven, K.J., Halldin, S., 2011. Stage-discharge uncertainty derived with a non-stationary rating curve in the Choluteca River, Honduras. Hydrol. Process. 25 (4), 603e613. Yu, K.M., Jones, M.C., 1998. Local linear quantile regression. J. Am. Stat. Assoc. 93 (441), 228e237. Zhang, Y.Q., Viney, N.R., Chen, Y., Li, H.Y., 2013. Collation of Australian Modeller's Streamflow Dataset for 780 Unregulated Australian Catchments. CSIRO: Water for a Healthy Country National Research Flagship, p. 117. Zhang, Y.Q., Viney, N.R., Chiew, F.H.S., Van Dijk, A.I.J.M., Liu, Y., 2011. Improving hydrological and vegetation modelling using regional model calibration schemes together with remote sensing data. In: MODSIM2011, 19th International Congress on Modelling and Simulation Modelling and Simulation Society of Australia and New Zealand, December 2011, pp. 3448e3454.