Previous studiesThe benchmark dataset of the ISTI will be global and is also intended to be used to estimate uncertainties in the climate signal due to remaining inhomogeneities. These are the two main improvements over previous validation studies.
Williams, Menne, and Thorne (2012) validated the pairwise homogenization algorithm of NOAA on a dataset mimicking the US Historical Climate Network. The paper focusses on how well large-scale biases can be removed.
The COST Action HOME has performed a benchmarking of several small networks (5 to 19 stations) realistically mimicking European climate networks (Venema et al., 2012). It main aim was to intercompare homogenization algorithms, the small networks allowed HOME to also test manual homogenization methods.
These two studies were blind, in other words the scientists homogenizing the data did not know where the inhomogeneities were. An interesting coincidence is that the people who generated the blind benchmarking data were outsiders at the time: Peter Thorne for NOAA and me for HOME. This probably explains why we both made an error, which we should not repeat in the ISTI.
Inhomogeneities for benchmarkingOne of the nice things of benchmarking, of generating artificial inhomogeneities, is that you have to specify exactly how the inhomogeneities look like. By doing so you notice how much we do not know yet. For the ISTI benchmark we noticed how little we know about the statistical properties and causes of inhomogeneities outside of Europe and the USA. The frequency and magnitude of the breaks are expected to be similar, but we know little about the seasonal cycle and biases. This makes it important to include a broad range of possibilities in a number of artificial worlds. Afterward we can test which benchmark was nearest to the real dataset by comparing the detected inhomogeneities for the real data and the various benchmarks.
One of those details is the question how to implement the break inhomogeneities. Some breaks are known in the metadata (station history), for example relocations, changes of instrumentation, changes of screens. If you make a distribution of the jump size, the mean temperature before and after this break, you find a normal distribution with a standard deviation of about 0.7°C for the USA. For Europe the experts thought that 0.8°C would be a realistic value.
But size is not all. There are two main ways to implement such inhomogeneities. You can perturb the homogeneous data between two inhomogeneities (by a random number drawn from a normal distribution); that is what I would call noise. The noise is a deviation from a baseline (the homogeneous validation data, which should be reconstructed at the end).
You can also start at the beginning, perturb the data for the first homogeneous subperiod (HSP), the period between the first and the second break (by a random number), and then perturb the second HSP relative to the first HSP. This is what I would call a random walk.
This makes a difference, for the random walk the deviation from the homogeneous data grows with every break, at least on average. For the noise, the deviation stays the same on average. As a consequence, the random walk thus produces larger trend errors as the noise and the breaks are probably also easier to find.
real inhomogeneitiesThe next question is: how do real inhomogeneities look like? Like noise, or like a walk? We studied this in the paper on the HOME benchmark by comparing the statistical properties of the detected breaks on the benchmark and on some real datasets. To understand the quote from the HOME paper below you have to know that the benchmark contained two dataset with two different methods to generate the homogeneous data: surrogate and synthetic data.
If the perturbations applied at a break were independent, the perturbation time series would be a random walk. In the benchmark the perturbations are modeled as random noise, as a deviation from a baseline signal, which means that after a large break up (down) the probability of a break down (up) is increased. Defining a platform as a pair of breaks with opposite sign, this means that modeling the breaks as a random noise produces more than 50 % platform pairs [while for a random walk the percentage is 50%, VV]. The percentage of platforms in the real temperature data section is 59 (n=742), in the surrogate data 64 (n=1360), and in the synthetic data 62 (n=1267). The artificial temperature data thus contains more platforms; the real data is more like a random walk. This percentage of platforms and the difference between real and artificial data become larger if only pairs of breaks with a minimum magnitude are considered.In other words, for a random walk you expect 50% platform break pairs, the real number is clearly higher and close to the value for a noise. However, there are somewhat less platform break pairs as you would expect in a noise. Thus reality is in between, but quite close to noise.
In HOME we have modelled the perturbations as noise, which was luckily good. Lucky, because when generating the dataset I had not not considered the alternative; maybe my colleagues would have warned me. However, I had stupidly not thought of the simple fact that the size of the breaks is larger than the size of the noise, by the square root of two, because the size of one jump is determined by two values, the noise before and the noise after. This is probably the main reason why we found in the same validation that the break we had inserted were too big: we used 0.8°C for the standard deviation of the noise but 0.6°C would have been closer to the real datasets. In the NOAA benchmarking study the perturbation were modelled as a random walk, if I understand it correctly (and a wide range of break sizes from 1.5 to 0.2°C).
The ISTI benchmarkHow should we insert break inhomogeneities in the ISTI benchmark? Mainly as noise, but also partially as a random walk, I would argue.
Here it should be remembered that break inhomogeneities are not purely random, but can also have a bias. For example the transition to Stevenson screens resulted in a bias of less than 0.2°C according to Parker (1994) mainly based on North-West European data. The older data had a warm bias due to radiation errors.
It makes sense to expect that once such errors have been noticed, that they are not reintroduced again. In other words, that such bias inhomogeneities behave like random walks, they continue until the end and future inhomogeneities use them as basis. If we would model biases due to inhomogeneities as random walks and their random components as a noise, we may well be close to the mixture of noise and random walk found in real data.
One last complication is that the bias is not constant, the network mean bias of a certain transition will have different effects on every station. For example in case of such radiation errors, it would depend on the insolation and thus cloudiness (for the maximum temperature) and on the humidity and cloudiness at night (minimum temperature) and on wind (because of ventilation).
Thus if the network mean bias is bn, then the station bias (bs) could be drawn from a normal distribution with mean bn and a width of rn. To this one would add a random component (rs)drawn from a normal distribution with mean zero and a standard deviation around 0.6°C. The bias component would be implemented as a perturbation from the break to the end and the random component as a perturbation from one break to the next.
ReferencesParker, D.E. Effects of changing exposure of thermometers at land stations. Int. J. Climatol., 14, pp. 1–31 doi: 10.1002/joc.3370140102, 1994.
Thorne, P.W., K.M. Willett, R.J. Allan, S. Bojinski, J.R. Christy, N. Fox, et al. Guiding the Creation of A Comprehensive Surface Temperature Resource for Twenty-First-Century Climate Science. Bull. Amer. Meteor. Soc., 92, ES40–ES47, doi: 10.1175/2011BAMS3124.1, 2011.
Venema, V., O. Mestre, E. Aguilar, I. Auer, J.A. Guijarro, P. Domonkos, G. Vertacnik, T. Szentimrey, P. Stepanek, P. Zahradnicek, J. Viarre, G. Müller-Westermeier, M. Lakatos, C.N. Williams, M.J. Menne, R. Lindau, D. Rasol, E. Rustemeier, K. Kolokythas, T. Marinova, L. Andresen, F. Acquaotta, S. Fratianni, S. Cheval, M. Klancar, M. Brunetti, Ch. Gruber, M. Prohom Duran, T. Likso, P. Esteban, Th. Brandsma. Benchmarking homogenization algorithms for monthly data. Climate of the Past, 8, pp. 89-115, doi: 10.5194/cp-8-89-2012, 2012.
Williams, C.N., Jr., M.J. Menne, and P. Thorne. Benchmarking the performance of pairwise homogenization of surface temperatures in the United States. J. Geophys. Res., 117, no. D05116, doi: 10.1029/2011JD016761, 2012.