Wednesday, 30 November 2016

Statistically significant trends - Short-term temperature trend are more uncertain than you probably think


Yellowknife, Canada, where the annual mean temperature is zero degrees Celsius.

In times of publish or perish, it can be tempting to put "hiatus" in your title and publish an average article on climate variability in one of the prestigious Nature journals. But my impression is that this does not explain all of the enthusiasm for short-term trends. Humans are greedy pattern detectors: it is better to see a tiger, a conspiracy or trend change one time too much than one time too little. Thus maybe humans have a tendency to see significant trends where statistics keeps a cooler head.

Whatever the case, I expect that also many scientists will be surprised to see how large the difference in uncertainty is between long-term and short-term trends. However, I will start with the basics, hoping that everyone can understand the argument.

Statistically significant

That something is statistically significant means that it is unlikely to happen due to chance alone. When we call a trend statistically significant, it means that it is unlikely that there was no trend, but that the trend you see is due to chance. Thus to study whether a trend is statistically significant, we need to study how large a trend can be when we draw random numbers.

For each of the four plots below, I drew ten random numbers and then computed the trend. This could be 10 years of the yearly average temperature in [[Yellowknife]]*. Random numbers do not have a trend, but as you can see, a realisation of 10 random numbers appears to have one. These trends may be non-zero, but they are not significant.



If you draw 10 numbers and compute their trends many times, you can see the range of trends that are possible in the left panel below. On average these trends are zero, but a single realisation can easily have a trend of 0.2. Even higher values are possible with a very small probability. The statistical uncertainty is typically expressed as a confidence interval that contains 95% of all points. Thus even when there is no trend, there is a 5% chance that the data has a trend that is wrongly seen as significant.**

If you draw 20 numbers, 20 years of data, the right panel shows that those trends are already quite a lot more accurate, there is much less scatter.



To have a look at the trend errors for a range of different lengths of the series, the above procedure was repeated for lengths between 5 and 140 random numbers (or years) in steps of 5 years. The confidence interval of the trend for each of these lengths is plotted below. For short periods the uncertainty in the trend is enormous. It shoots up.



In fact, the confidence range for short periods shoots up so fast that it is hard to read the plot. Thus let's show the same data with different (double-logarithmic) axis in the graph below. Then the relationship look like a line. That shows that size of the confidence interval is a power law function of the number of years.

The exponent is -1.5. As an example that means that the confidence interval of a ten year trend is 32 (101.5) times as large as the one of a hundred year trend.



Some people looking at the global mean temperature increase plotted below claim to see a hiatus between the years 1998 and 2013. A few years ago I could imagine people thinking: that looks funny, let's make a statistical test whether there is a change in the trend. But when the answer then clearly is "No, no way", and the evidence shows it is "mostly just short-term fluctuations from El Nino", I find it hard to understand why people believe in this idea so strongly that they defend it against this evidence.

Especially now it is so clear, without any need for statistics, that there never was anything like an "hiatus". But still some people claim there was one, but it stopped. I have no words. Really, I am not faking this dear colleagues. I am at a loss.

Maybe people look at the graph below and think, well that "hiatus" is ten percent of the data and intuit that the uncertainty of the trend is only 10 times as large, not realising that it is 32 times.



Maybe people use their intuition from computing averages; the uncertainty of a ten year average is only 3 times as large that of a 100 year average. That is a completely different game.

The plots below for the uncertainty in the average are made in the same way as the above plots for the trend uncertainty. Also here more data is better, but the function is much less steep. Plots of power laws always look very similar, you need to compare the axis or the computed exponent, which in this case is only -0.5.





It is typical to use 30 year periods to study the climate. These so-called climate normals were introduced around 1900 in a time the climate was more or less stable and the climate needed to be described for agriculture, geography and the like. Sometimes it is argued that to compute climate trends you need at least 30 years of data, that is not a bad rule of thumb and would avoid a lot of nonsense, but the 30 year periods were not intended as a period on which to compute trends. Given how bad the intuition of people apparently is there seems to be no alternative to formally computing the confidence interval.

That short-term trends have such a large uncertainty also provides some insight into the importance of homogenisation. The typical time between two inhomogeneities is 15 to 20 years for temperature. The trend over the homogeneous subperiods between two inhomogeneities is thus very uncertain and not that important for the long-term trend. What counts is the trend of the averages of the homogeneous subperiods.

That insight makes you want to be sure you do a good job when homogenising your data rather than mindlessly assume everything will be alright and raw data good enough. Neville Nicholls wrote about how he started working on homogenisation:
When this work began 25 years or more ago, not even our scientist colleagues were very interested. At the first seminar I presented about our attempts to identify the biases in Australian weather data, one colleague told me I was wasting my time. He reckoned that the raw weather data were sufficiently accurate for any possible use people might make of them.
Sad.

[UPDATE: In part 2 of this series, I show how these large trend uncertainties in combination with the deceptive strategy of "cherry-picking" a specific period very easily produces a so-called "hiatus".]


Related reading

How can the pause be both ‘false’ and caused by something?

Atmospheric warming hiatus: The peculiar debate about the 2% of the 2%

Sad that for Lamar Smith the "hiatus" has far-reaching policy implications

Temperature trend over last 15 years is twice as large as previously thought

Why raw temperatures show too little global warming

Notes

* In Yellowknife the annual mean temperature is about zero degrees Celsius. Locally the standard deviation of annual temperatures is about 1°C. Thus I could conveniently use the normal distribution with zero mean and standard deviation one. The global mean temperature has a much smaller standard deviation of its fluctuations around the long-term trend.
** Rather than calling something statistically significant and thus only communicating whether the probability was below 5% or not, it fortunately becomes more common to simply give the probability (p-value). In the past this was hard to compute and people compared their computation to the 5% levels given in statistical tables in books. With modern numerical software it is easy to compute the p-value itself.
*** Here is the cleaned R code to generated the plots of this post.


The photo of YellowKnife at the top is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

7 comments:

Dikran Marsupial said...

"That something is statistically significant means that it is unlikely to happen due to chance alone. When we call a trend statistically significant, it means that it is unlikely that there was no trend, but that the trend you see is due to chance."

I don't think that is quite true. If an effect is statistically significant it means that if it happened by chance, it would be unusual to see an effect size as large as that observed, which is not quite the same thing. The p-value is p(X>xo|H0)m i.e. the probability of an effect size (X) greater than that actually observed (xo) IF the null hypothesis (the effect is due to chance), rather than p(H0|xo) i.e. the probability that the effect was due to change (the null hypothesis) given the observed effect size. Unfortunately a frequentist hypothesis test fundamentally can't tell you p(H0|xo) (i.e. the thing you really want to know) because a frequentist can't assign a non-trivial probability to the truth of a particular hypothesis as it has no long run frequency (the way they define a probability), it is either true or it isn't - it isn't a random variable. This is why we should just say "we are [un]able to reject the null hypothesis" rather than say that the result probably occurred by chance, or that it didn't (well actually we can, but if we do so we are silently switching to a subjectivist Bayesian framework without stating our priors). The message of the post though, I totally agree with!

Victor Venema said...

Yes, I know I should have talked about accepting or rejecting the null hypothesis. If only because when the trend was larger, that could also have been because the signal was not made of random numbers, but correlated numbers.

I did not want to go there to make it accessible and shorter. I thought within that limit, it was quite a nifty formulation. :) If that means I am a fake Bayesian, that is okay. There are worse people.

Dikran Marsupial said...

No problem, the caveat is giving in the comments and that is all that is needed. It is a difficult issue, because quite often it is reasonable to say that statistical significance means that the effect is unlikely to have occurred by chance, but the form of the analysis doesn't (and indeed can't) actually establish that. Unfortunately it is the question we want to ask, so it is natural to interpret the p-value as an answer to that question, rather than the question the NHST actually answers. The same problem arises with confidence intervals, most interpret them as meaning that the true value of the statistic lies in the interval with 95% probability, but unfortunately that is not what it actually means as it would also be assigning a probability to the truth of a particular hypothesis.

While I am a Bayesian by inclination, it is best to be happy with both approaches and use the method that most directly answers the question you actually want answered. Unfortunately most of the time frequentist methods are easy to apply, but don't directly answer the question posed and are conceptually difficult, whereas Bayesian methods directly answer the question posed and are conceptually straight-forward, but the implementation is so tricky that they are not that often used (although improvements in software are helping this).

Interpreting a frequentist test in a Bayesian manner is fine in my book, provided the switch in frameworks is noted.

"There are worse people."

yes, those statistical pedants are the worst in my experience! ;o)

realfacepalm said...

I usually use 30 yeras, and call out any global "trend" with less than 17 years:
"...Our results show that temperature records of at least 17 years in length are required for identifying human effects on global-mean tropospheric temperature."
"Separating signal and noise in atmospheric temperature changes: The importance of timescale" JGR Atmospheres, Santer et al 2011
http://onlinelibrary.wiley.com/doi/10.1029/2011JD016263/abstract

and a lot of Blog posts from tamino on too short periods and broken trends; e.g. https://tamino.wordpress.com.../2016/10/18/breaking-bad/

Victor Venema said...

With "at least 17 years" Santer does not say that 17 years is enough, only that less is certainly not enough. So also with Santer you can call out people looking at short-term trends longer than 17 years.

And if I recall correctly, this was a study based on climate model data. Models vary enormously when it comes to their internal variability, Ed Hawkins shows in his blog post with the best title ever. So I would add an additional buffer for that.

Tamino is great. Everyone should read his posts before making a fool of themselves with another paper with "hiatus" in the title or abstract.

Brandon Shollenberger said...

I figured I should let you know I've published a post which is highly critical of this. You can read the details if you want, but put simply, the "analysis" used in this post is garbage. The series used in it are so radically different from what we have for temperature records the results of this "analysis" are meaningless.

Two central differences are: 1) Temperature data is not limited to annual records. The idea we have only 10 data points for 10 years of temperatures is laughable and necessary to come up with the claim there is 32 times as much uncertainty in 10 year trends as in 100 year trends. 2) The trends in this "analysis" are estimated for data which has no uncertainty. In the real world, individual points of data have uncertainty, and that uncertainty increases the further back in time you go. The increased uncertainty of past temperatures over recent temperatures would necessarily increase the uncertainty in any 100 year trend relative to any trend during the 21st century.

The "analysis" used in this post is highly misleading. As a consequence, the results are greatly exaggerated.

Victor Venema said...

Yes, for long periods inhomogeneities and how well we can correct them become more important. That is the argument I make at the end of my post.

Also for longer periods the assumption that the trend is linear no longer works.

Reality is complicated and there is no end to how many complications one can add. I was trained as a physicist and thus have the tendency to go to the simplest system that demonstrates the effect and study that system well. I feel that is the best route to understanding. You seem to be more of a climatologist; they like adding bells and whistles.

You are the second person this week protesting that I did not write the post they would have wanted me to write. Auto-correlations mean that you have effectively less samples and that short term trends thus become even more uncertain. The auto-correlations for annual mean temperatures are modest. Monthly data has much stronger auto-correlations. Thus you do not get much more data when you go from annual to monthly data. You do get additional complications due to the uncertainty in the seasonal cycle and that the fluctuations also have a seasonal cycle. I did not want to got there, but show the principle.

If you write a post about monthly data do let me know, would be interested how much difference that makes compared to the factor 32 in uncertainty for a period that 10 times as long. Would be surprised if this difference is not still a lot larger for trends than it is for averages and also a lot larger than just a factor 10.