Showing posts with label Calibration. Show all posts
Showing posts with label Calibration. Show all posts

Tuesday, October 31, 2017

Meledrio, or a simple reflection on Hydrological modelling - Part VI - A little about calibration

The normal calibration strategy is to split the data we want to reproduce into two setz:

  • one for the calibration phase
  • one for the "validation" phase
Let's assume that we have an automatic calibrator. It usually:
  • generates a set of model's parameters, 
  • estimates with the rainfall-runoff hydrological model and any given set of parameters the discharges, 
  • compares what computed with what is measured by using a goodness of fit indicator
  • keeps the set of parameter that gives the best performances
  • repeats the operation a huge number of times (and use some heuristics for searching the best set overall)

This  set of parameters is the one used for "forecasting" and

  • is now used against the validation set to check its performances.
However, my  experience (with my students who usually perform it) is that the best parameter set in the calibration procedure, is not usually the best in validation procedure. So I suggest, at least as a trial and for further investigations to:

  • separate the initial data set into 3 parts (one for first calibration, one for selection, and one for validation).
  • Among the 1% (or x% where x is let at your decision) of best performing in the calibration phase  is selected (called the behavioural set). Then 1% (one over 10^4) best performing in the selection phase is further sieved. 
  • This 1 per ten thousand is chosen to be used in the validation phase
The hypothesis to test is that this three steps way to calibrate returns usually better performances in validation than the original two step steps one.

Tuesday, July 26, 2011

Quantifying uncertainty

I saw on EOS an announcement regarding a new website, QUEST, or quantifying uncertainty in ecosystem studies. Hydrology in fact needs an effort to quantify uncertainties, but this is usually ignored.

It is quite a few years that I fly around of this issue, and probably next year I would try to dig a more little deep in literature.

The sources of uncertainty in hydrological modeling are at least three:

- the input data (which can derive from chaotic dynamics)
- the approximation contained in equations
- the parameterizations of constants which can be heterogeneous (highly variable, if not random in space)


When thinking to inputs, the paradigm is rainfall. It is usually estimated just a few points in the domain with large errors. Then, the local estimates needs to be interpolated and extrapolated in space, introducing further errors. Rainfall itself is very irregular in time and space at all of the scales, which means that you can capture just the statistics of its behavior (and this left you out in the cold with other errors).

When looking at flows, i.e. at their mathematical description in equations, one has to think that they are eminently a thermodynamical product where some fluctuations need to be neglected and described with suitably averaged properties, which could be no possible (significant).
Besides, usually the system described is made up of many non-linearly connected subsystems, and in practical implementations the nonlinearities and the feedbacks are simplified or even neglected. Moreover, equations need to be discretized on grid, which introduce itself approximations.

Finally, saying that some processes are governed by heterogeneities, we also state that the information they contain is algorithmically incompressible (e.g. Chaitin), and there is no way to represent it in short strings. The latter syndrome is the one well described by Borges in "The Exactitude of Sciences", but also in Noam Chomsky's book, Rules and Representation, where at page 8 he cites Stephen Weinberg, and goes so deep in asking if we can really know reality, and what it means.

In any case, the hydrological community, started to take care of it from a long time (here it is a recent abstract with hopefully a good literary review, and here, the work by Beven, Gupta and Wagener), but they started to work especially on the assessment of parameter uncertainty (even if GLUE pretends to be of more general validity). A recent assessment is also in this work by Goetzinger and Bardossy which can provide access to further concepts and bibliography.

However, many hydrological models produce just time series, and therefore the uncertainty reduce to understand (and sometimes to compare) a couple of time series: the measured time serie and the modeled time serie. Good hydrological models are those that reproduce the time series with a good agreement. This is quantified, often but not always, with the use of indexes. The mean square error or its root, the Nash-Sutcliffe, the minimax objective function, average absolute percentage error, the index of agreement, the coefficient of determination, are a few of them.

This is certainly a narrow perspective to look at the topic. Both the measured and the simulated series are, in fact, affected by errors, and therefore one should not compare the two series directly , but the time series including their errors. I believe that this would coincide to adopting a Bayesian perspective of the problem (e.g D'Agostini 2003 - Bayesian Reasoning in Data Analysis, A critical Introduction) and will turn into data assimilation (e.g. Kalnay, Atmospheric Modeling, Data Assimilation and Predictability, 2003) with the defect that, at this point, data and models are so entangled that it would be difficult to extricate them (but not impossible, I guess).

We can also observe that a model usually produces more than a single time series. So "a prediction" becomes the "predictions" and the uncertainty spreads in all of them.

Besides, we did not mention, spatial patterns: before we claims for their uncertainty, we have to recognize that we should quantify them. How can we do ? And for extension are we able to identify spatio-temporal patterns ? And therefore when we can decide if two of these patterns are the same (neglecting noises). Indicators of statistical equality would probably give miserable scores if applied to two or three dimensional fields.

Someone has ideas ?