Wednesday, January 29, 2014

A geographic information model for hydrology - from the Global Runoff Data Center

I take directly from an informative of Center

"The Global Runoff Data Centre (GRDC) of the World Meteorological Organization (WMO) would like to inform you about the publication of GRDC-Report 43r1: 

Dornblut, I. and R. Atkinson: HY_Features : a geographic information model for the hydrology domain. Concepts of the HY_Features common hydrologic feature model. - GRDC - Report; 43r1. - Koblenz: Federal Institute of Hydrology (BfG), 2013.


Abstract: 
"Hydrologic features are abstractions of complex real world hydrologic processes, and relate to everyday concepts such as rivers, lakes, catchments etc. Different models of hydrologic processes, and different scales of detail, lead to a variety of information models to describe hydrologic features, and to different and mostly incompatible sets of feature identifiers. 

This document describes the concepts of the HY_FEATURES common hydrologic feature model, a conceptual model for hydrologic features independent from approximate geometric representations at different scales. This model allows common reference to both the specific semantics and individual identifiers of hydrologic features across scientific sub-disciplines in hydrology. 

The HY_FEATURES model is intended to form the basis for standard practices for referencing hydrologic features. These practices would be policy under the auspices of the WMO Commission for Hydrology (WMO-CHy) and recommended for general use in the wider community.  HY_FEATURES is designed as a set of interrelated Application Schemas using ISO 19103 Conceptual Schema Language and ISO 19109 General Feature Model. It is factored into relatively simple components that can be reviewed, tested and extended independently. "

The GRDC-Report 43r1 is available at http://doi.bafg.de/BfG/2013/GRDC_Report_43,1.pdf and at the GRDC Website http://www.bafg.de/GRDC/EN/02_srvcs/24_rprtsrs/reports_node.html. Questions regarding this document should be directed to the contributing authors.


The GRDC is acting under the auspices of the World Meteorological Organization (WMO) and is supported by WMO Resolutions 21 (Cg XII, 1995) and 25 (Cg XIII, 1999). Its primary task is to maintain, extend and promote a global database on river discharge aimed at supporting international organizations and programs by serving essential data and products to the international hydrologic and climate research and assessment community in their endeavour to better understand the Earth system. The GRDC was established at the German Federal Institute of Hydrology (BfG) in 1988. The National Hydrological and Meteorological Services of the 191 WMO Member states and territories are the principal data providers for the GRDC."

Monday, January 20, 2014

Luca's references on soil moisture spatial variability and remote sensing

In trying to extract the publishable results from Ageel Bushara Ph.D. thesis,  a good work indeed, but weakened by my ignorance in remote sensing, I started a conversation with Luca Brocca, one of the most prominent young italian hydrologists.  As befits in good conversations, Luca suggested some initial readings.
Here they are:

REFERENCES

1) Teuling et al. 2005 GRL they obtained good results comparing the spatial variability of the data but they do not have lateral flow of water

2) Brocca et al. 2013 JoH. As for Teuling, good results in the estimation of the spatial variability (however, the model is calibrated in any single point). Here we had a different scope, which was to obtain a lon soil moisture time series.

3) Walker et al. 2002 HYP using soil moisture estimates from SAR and comparison with ground data. IMHO not very good results (in Australia).

4) Li and Rodell 2013 HESS: they obtain that the spatial variability of in situ data (SCAN) is very different from the one modelled (Noah land surface model) and also different from the one obtain by another satellite (AMSRE, microwave passive sensor, 25 km). The study covers all the USA (CONUS).

Learn Statistics and Probability !

In one of my previous posts, I talked about the necessity to use (and therefore learn) statistics. As said in "What is Statistics?", anyone working with data is using statistics. This simplifies a lot the approach. Actually I arrived to statistics mostly from the teaching side. As a scientist, indeed, I often overlooked statistics.  Even if, in part of my research statistics appears, it would be a gross exaggeration to say that I approached it consciously (kind of take it for granted). Time to time I used (ripped off) methods to fit data, but I never had a systematical approach to it. 
From the teaching side, I had instead to communicate some concepts to students, and thus I tried to be more methodical. My efforts of synthesis  produced my slides on probability and on statistics. In fact I solved the dualism between the two by saying that statistics has to do with reality, while probability is an axiomatic theory, which leave out the identification of what the probability itself is (De Finetti teaches). What people usually do is to search for a "model" among the one available in the "models market" (and therefore there is a phase where these models are "invented" analysed theoretically). The models in turns are distribution functions, regression functions, or whatever function(al) is necessary. In a second phase, you have then to see how the model of your choice adapts (fits!) to real life. In this adaptation you rely on statistics and statistical methods and on Bayes theorem. You can interpret the procedure according to a frequentist approach or through a Bayesian one. The latter procedure is becoming dominant in the field I frequent, but probably has to escape some inductive traps (i.e. the idea that just from induction one can get knowledge, while the scientific method has a hypothetical-deductive structure: look at Bayes here). In fact, there could be a third approach, where "the machines" find the model for you (see Breiman,  2001 and discussions therein ^1).

While the concept of distribution remains always under the hood of any approach (maybe less evident in Machine Learning) and can probably used as a “connecting principle”, in my ignorant perception of the matter the whole picture remains a little obscure.

The fact is that the principles of statistics (and of probability) are taught abstractly thinking to a A -> [0,1] application where A is some undefined set, and, most of the time, when thinking to applications we are using some subspace of R^n, the real numbers, into [0,1], but the object of investigation is a single value of a single quantity (let say a unique measure - do not charge here the word of mathematical significance - of a quantity). Using the concepts of hydrology, this would cover a zero-dimensional domain or (modelling). 

In fact hydrology, and reality, are perceived as multidimensional. So important applications and important measures vary, for instance, in time.  This fact confuses your  ideas since in principle we have to analyse many quantities (as many as the instant of time)  so our application is not anymore, at the very general stage, the study of a single quantity but of many quantities. However,  either for practical or for physical reasons we often conceive these quantities as the manifestation of a sigle one (as the realisation of the same hidden probabilistic structure repeated many times, not necessarily with whatsoever relation between them). 
For good or for bad, this is usually ignored theoretically while, in practice, it brings to a separate subfield, of which time series analysis is an example.  Also fitting one variable versus another (or others) falls in the same dimensional domain. It appears in books smoothly, as it would be natural, but it always let me with some discomfort. Only in the statistical book by von Storch and Zwiers, my dimensional distinction appears (especially looking at the book's index). In classical books, this passage is actually mediated by looking at random-walks, Markov chains, martingales and other similar topics. The key of this passage of dimensionality is the introduction of some correlation -in the common language sense, but also in the probabilistic sense - that ties one datum to another (the subsequent one).

A further, and consequent, passage happens when one moves to analyse not a line of events but, a space of events, with the further complication that multiple dimensions cannot even exploit the ordering of 1-D problems. 

Nowadays patterns in two or multidimensional spaces, in fact, are discovered by machines (at least if I properly understand the concept of Machine Learning), with, again, some danger to fall in an excessive inductivism

Going to the point how to learn this stuff. I would start from a book on probability where the axiomatic structure of the field would be clear.  In my formation, this role was accomplished by the old classic, Feller’s (1968) book. (Let say the first two chapters, which are now reproduced in almost all the textbooks. Then the following chapter, but skipping the * sections. Possibly section XI concludes this first -0-D- part (Waiting times appears: but they are not actually related to “time” but to an “ensable” of trials). Looking for on-line resources, I also found the book by Grinstead that covers more or less the same topics. 

The probabilistic part should be complemented, at this point by some statistics. Most  of the good statistical books simply redo all probability theory from a more practical point of view, before going to their specific, how to infer from data their distribution (if any exists),  but these can be skipped or just browsed, then.  As J. Franklin says “Mathematicians, pure and applied, think there is something weirdly different about statistics. they are right. It is not part of combinatorics or measure theory, but an alien science with its own modes of thinking. Inference is essential to it, so it is, as Jaynes says, more a form of (non-deductive) logic.” 

Being prepared to controversies, a  couple of good books for learning statistics with climate and/or hydrologic orientation are the book by Hans von Storch and Francis W. Zwiers or Kottegoda and Rosso (expensive) books.  These have the advantage to  use hydro-meteorological datasets examples. In these, after the first chapters, the subsequent chapters follow a perspective where the goal is to  choose “a model, a distribution or process that is believed from tradition or intuition to be appropriate to the class of problems in question”, and subsequently “statistically validate” it using data to estimate the parameters of the model. “That picture, standardised by Fisher and Neyman in the 1930s, has proved in many ways  remarkably serviceable. It is especially reasonable where it is known that the data are generated by  a physical process that conforms to the model. As a gateway to these mysteries, the combinatorics of dice and coins are recommended; the energetic youth who invest  heavily in the calculation of relative frequencies will be inclined to protect their investment through faith in the frequentist philosophy that probabilities are all really relative frequencies.” (also from J. Franklin, 2005). 

My favorite reading on many of these statistical computing techniques  are the Cosma Shalizi’s notes which certainly presents the topics in an original way that cannot be found elsewhere. Shalizi’s  notes, as well as Gareth et al (2013)  ones have also the advantage to use R as computational tool, and to present some modern topic like a chapter on Machine Learning.  Hastie et al., 2005 is instead an advanced lecture on the same topics. These books are actually more decisely oriented to statistical modelling, as well as is the Hyndman and Athanasopoulos (2013) free (and simple) on line book (also using R). 

Kottegoda and Rosso book hosts also a chapter on Bayesian statistics, which is the other way to see statistical inference. A brief introduction to the Bayesian mistery is Edward Campbell’s brief technical report that can be found here. The possibly longest one, which present a different approach to probability, is the posthumous masterpiece by Jaynes (2003), which is probably a fundamental reading on the topic. 

However, my personal understanding of the Bayesian methodology gained some consistency only after the reading of G. D’Agostini (2003) book. Actually D’Agostini can be defined as a Bayesian evangelist, but its arguments, even if some examples in high particle physics remain to me actually unclear, convinced me to a mild conversion.
As a matter of fact, my real understanding of the Bayesian approach is still poor. Not because I did not understand the theory, but because between the theory and its application there is a gap which I still do not have filled. (Practice it!)

Looking at all of these contributions, sum up to thousands of pages. Possibly many of these pages are repetitions of the same concepts. Sometimes from slightly different point of view. To the reader the choice of what to do. 


A last note regards how to made calculations.  For doing it R is certainly a good choice, that some of the cited books’ authors already did, and the support for doing it really is large and growing.

Notes

^1 - The paper is also remarkable for sir David Cox (he also has his introductory and conceptual book) answer. Cox, besides being a prominent British statistician, has quite a carrier in hydrology, especially looking at his work on rainfall and eco-hydrology together with Ignacio Rodriguez-Iturbe.

References (with some additions to the text)

Berliner, L. M., & Royle, J. A. (1998). Bayesian Methods in the Atmospheric Sciences, 6, 1–17.

Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. doi:10.1214/ss/1009213726

Campbell, E. P. (2004). An Introduction to Physical-Statistical ModellingUsing Bayesian Methods (pp. 1–18).

Cox, D. R., & Donnelly, C. A. (2011). Principles of applied statistics. Cambridge University Press.


Durrett, R. (2010). Probability: theory and examples.


Feller, W. (2007). The fundamental limit theorems in Probability, 1–33.

Fienberg, S. E. (2014). What Is Statistics? Annual Review of Statistics and Its Application, 1(1), 1–9. doi:10.1146/annurev-statistics-022513-115703


Gelman, A. (2003). A Bayesian Formulation of Exploratory Data Analysis and Goodness‐of‐fit Testing*. International Statistical Review. doi:10.1111/j.1751-5823.2003.tb00203.x

Grinstead, C. M. and Snell JL (2007). Introduction to Probability (pp. 1–520).

Guttorp, P. (2014). Statistics and Climate. Annual Review of Statistics and Its Application, 1(1), 87–101. doi:10.1146/annurev-statistics-022513-115648

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning, 103. doi:10.1007/978-1-4614-7138-7


Kharin, S. (2008c, May 19). Statistical concepts in climate research - I. slides
Kharin, S. (2008b, May 19). Classical Hypothesis Testing. -II slides
Kharin, S. (2008a). Climate Change Detection and Attribution: Bayesian view, 1–35. III slides


Madigan, D., Stang, P. E., Berlin, J. A., Schuemie, M., Overhage, J. M., Suchard, M. A., et al. (2013). A Systematic Statistical Approach to Evaluating Evidence from Observational Studies. Annual Review of Statistics and Its Application, 1(1), 131125173259005. doi:10.1146/annurev-statistics-022513-115645

Shalizi, C. R. (2014). Advanced Data Analysis from an Elementary Point of View (pp. 1–584).

Storch, von, H., & Zwiers, (2003) F. W. (n.d.). Statistical analysis in climate research. Cambridge University Press


Zwiers, F. W., & Storch, von, H. (2004). On the role of statistics in climate research. International Journal of Climatology, 24(6), 665–680. doi:10.1002/joc.1027

Thursday, January 9, 2014

Why to write and publish scientific papers in hydrology

Luca Brocca (also here) brought to my attention a series of slides from Demetris Koutsoyiannis (see also here) about the challenges faced in publishing a paper.  I dedicated one of my first, and others posts, to this, but a further lecture can be interesting and this is even enjoyable.

From Demetris slides (here) I cite some  from Schulman E.R.'s humorous paper:

"Scientific papers … are an important—though poorly understood—method of publication. They are important because without  them scientists cannot get money from the government or from universities. They are poorly  understood because they are not written very well.
… "

and three statements which could be interpreted as truisms, but they are not (with my comments):
  • Reading other good papers is much more useful than reading guidelines about how to write and publish papers (but you have to know the basics of writing a scientific paper, and someone that dissects it for you would help)
  • Writing a good paper presupposes good understanding of the subject studied (In my view not entirely true ... you just need to have an intuition and pursue it. Having it all clear can take decades, and in the long range we are all dead. As D.K. says too, a good scientific paper is something in the flow of knowledge. Do not wait too much to write it)
  • Publishing the paper presupposes good understanding of how the peer review process works (definitely true. Some opportunistic behaviour is necessary to survive).
To sum up, D.K. presentation is a "must read".

Looking at publishing from another side,  papers are not all the outcomes of a research, but one among others, as are models, patents, books, data, and other stuff, especially if one considers  that not everybody is a professor in real life (is here is sort of an equation ? -  academy=publishing / not academy .. : do not care ?). Certainly writing a paper and going trough a review process can challenge your certainties, and refine your knowledge. IMHO doing research and writing are two different jobs, that meet only in good papers (and with the further note that unpublished research does not produced shared knowledge, and therefore science^2^2b).

Finishing with who started the thread, Luca B. also tries to point out (the papers in figure are from his own CV^1) that publications history, and peer review process can be weird and, at the light of history, even wrong (as a brief report of Keith Beven^3 included  in D.K presentation also tells). However, being consistent, he could publish. Being consistent about consistency, besides being smart, is certainly a key of success.

Notes

^1 -  I could also put the same comment on some of my papers,and, I could further include a paper that was never published after good reviews, just for the decision of the AE. Never mind: peer review is the worst form of selecting papers, except all the others form that have been tried from time to time (paraphrasing W. Churchill).

^2 - Something is still transmitted through oral communication though.

^2b - Science, however, is not just shared knowledge. For instance, it is substantiated by the possibility of checking  (well, falsifying) assertions with experiments, yes, at the uncertain light of probability.

^3 - I admire K. Beven work, and I almost completely agree on what Jeff McDonnell (Google scholar here) says on him (reported in D.K slides). However, I am urged to remind, without offending anyone, that there is a certain difference between Saul Bellow and Stephen King. The first has certainly less readers than the first, but ...