Stat talk: What’s a standard error and when should I report it rather than the standard deviation of my data?

Perhaps the most popular question SCU consultants get asked is whether it is better to report a standard deviation or a standard error. This question is always built upon a fundamental misunderstanding of these two very different concepts. I suspect that students have missed the point because their supervisors tell them, for example, to use a standard error because it’s smaller. Some students may find this to be a satisfactory answer for a little while, but the more inquisitive student may begin to feel unsatisfied with this response.

 Their intuitions are absolutely right. Standard deviations and standard errors are very different concepts and the correct one to use depends upon the context.

A standard deviation is a descriptive statistic describing the spread of a distribution. It is a very good description when the data are normally distributed. It is less useful when data are highly skewed or bimodal because it doesn’t describe very well the shape of the distribution. One generally uses standard deviation when reporting the characteristics of the sample, because one is describing how variable the data are around the mean. Other useful statistics for describing the spread of the data are interquartile range, the 25th and 75th percentiles, and the range of the data.

Figure 1. The standard deviation is a measure of the spread of the data. When data are a sample from a normally distributed distribution, then one expects two-thirds of the data to lie within 1 standard deviation of the mean.

Variance is a descriptive statistic also, and it is defined as the square of the standard deviation. It is not usually reported when describing results, but it is a more mathematically tractable formula (aka the sum of squared deviations) and plays a role in the computation of statistics. For example, if I have two statistics X & Y with known variances var(X) & var(Y), then the variance of the sum X+Y is equal to the sum of the variances: var(X) +var(Y). So you can see why statisticians like to talk about variances. But standard deviations carry an important meaning for spread, particularly when the data are normally distributed: The interval mean +- 1 SD can be expected to capture 2/3 of the sample, and the interval mean +- 2 SD can be expected to capture 95% of the sample. 

A standard error is an inferential statistic that is used when comparing sample means (averages) across populations. It is a measure of precision of the sample mean. The sample mean is a statistic derived from data that has an underlying distribution. We can’t visualize it in the same way as the data, since we have performed a single experiment and have only a single value. Statistical theory tells us that the sample mean (for a large “enough” sample and under a few regularity conditions) is approximately normally distributed. The standard deviation of this normal distribution is what we call the standard error.

Figure 2. The distribution at the bottom represents the distribution of the data, whereas the distribution at the top is the theoretical distribution of the sample mean. The SD of 20 is a measure of the spread of the data, whereas the SE of 5 is a measure of uncertainty around the sample mean.

When we want to compare the means of outcomes from a two-sample experiment of Treatment A vs Treatment B, then we need to estimate how precisely we’ve measured the means. Actually, we are interested in how precisely we’ve measured the difference between the two means. We call this measure the standard error of the difference. You may not be surprised to learn that the standard error of the difference in the sample means is a function of the standard errors of the means:

Now that you’ve understood that the standard error of the mean (SE) and the standard deviation of the distribution (SD) are two different beasts, you may be wondering how they got confused in the first place. Whilst they differ conceptually, they have a simple relationship mathematically:

where n is the number of data points.

Notice that the standard error depends upon two components: the standard deviation of the sample, and the size of the sample n. This makes intuitive sense: the larger the standard deviation of the sample, the less precise we can be about our estimate of the true mean. Also, the large the sample size, the more information we have about the population and the more precisely we can estimate the true mean.

Finally, to answer the question: when should I report SD? Whenever you want to describe the characteristics of the distribution of your sample, then use SD or other measures of spread. If you are reporting on a group of patients, you may want to report their mean age, and variation around the mean (reported as SD).

When should I report SE? Whenever you are comparing means and you wish to infer that the two means are different, then report SE. This applies to tables and also to graphs.

One of our students sent me a few good references on this subject that you may find interesting. More questions? Come visit the SCU for some more stat talk.

Additional References

Altman, D. G. and Bland, J. M. (2005) Standard deviations and standard errors, BMJ, 331, 903. http://www.ncbi.nlm.nih.gov/pubmed/16223828.

Biau, D. J. (2011) Standard deviation and standard error, Clin Orthop Relat Res, 469, 2661 – 2664. 

Nagele, P. (2003) Misuse of standard error of the mean (SEM) when reporting variability of a sample. A critical evaluation of four anaesthesia journals, Br J Anaesth, 90, 514-6.  http://www.ncbi.nlm.nih.gov/pubmed/12644429. 

This post originally appeared in SCU news items: the original author is unknown. It was selected for re-publication in this blog by guest blogger Alice Richardson.

Associate Professor Alice Richardson is Director of the Statistical Consulting Unit (SCU) at the Australian National University. Her research interests are in linear models and robust statistics; statistical properties of data mining methods; and innovation in statistics education. In her role at the SCU she applies statistical methods to large and small data sets, especially for research questions in population health and the biomedical sciences.

What’s going on with p values?

You could be forgiven for feeling a bit confused about the most appropriate way to summarise the statistical results of your research. Seems like in the past a p-value was all you needed. If p < 0.05, hooray! Significant result, paper published. If p > 0.05, misery! Non-significant result, paper pushed into bottom of filing cabinet. But nowadays, it’s not as simple any more.

Where did the notion of the magic 0.05 defining success and failure come from? The trails mostly end up with R.A. Fisher in his book “The Design of Experiments.” What he actually wrote was “The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.” The value he is referring to here, the 1.96, is the spot on a standard normal distribution which has 2.5% of the distribution above it, and 2.5% of the distribution below -1.96.

And that’s it! It’s not at all clear if he intended the research community to grab hold of this rule of thumb and run with it quite as far and quite as fast as it did.

Actually there have been rumblings about the misinterpretation of p values for almost as long as there have been p values. The references to the ASA statement show that papers with this message started to appear in the 1960s. The popular science media has been printing articles about this on a regular basis for the last 10 years or so.

Even the cartoonists have got in on the act: you can find one here and there are plenty more to entertain you. Try PhDcomics.com and xkcd for lots of light relief during your research.

Then there was the highly publicised and controversial move in 2015 by the Journal of Basic and Applied Psychology to ban all p values from their published journal articles. Next came the American Statistical Association’s statement on p values, published in 2016. This is the first time that the ASA has put out a statement on a specific statistical practice, so it’s not to be taken lightly. The original article containing the statement comes with supporting comments from 25 eminent statisticians.

The ASA article about the statement opens with a statement of the essential paradox underlining why p-values continue to persist. It goes like this:

Q: Why do so many colleges and grad schools teach p = .05?

A: Because that’s still what the scientific community and journal editors use.

Q: Why do so many people still use p = 0.05?

A: Because that’s what they were taught in college or grad school

And what exactly is a p-value anyway? The statement starts with this (informal) definition: “Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.”

Six principles govern the use of the p-value according to the ASA statement.

P-values can indicate how incompatible the data are with a specified statistical model.

This is the principle underlying what is likely to be your intuition about p-values: the smaller they are, the more incompatible your data is with the hypothesis (i.e. specified statistical model) about the data that you are testing.

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone

This is the big one when it comes to misinterpretation of p values. If I had $1 for every time a first year student told me this is what p values, measure, I would be a wealthy person indeed! It’s very tempting, of course, to say that a small p value means that there’s a small chance of my hypothesis being true, but a quick glance back at the informal definition shows that the temptation misses the mark in a couple of ways. Firstly, the definition involves an assumption that the hypothesis is true, not a test of whether it is or not. Secondly, the statement to which a probability is attached is not to do with the hypothesis, but to do with the data, summarised in the form of a test statistic. It’s subtle, to be sure, but important as well.

Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

It’s true that many statistical analyses do end in a conclusion being drawn or a decision being made. The point of this principle is that the big picture needs to be taken into account, which includes all the decisions leading up to the calculation of the p-value. The cost of a decision is also important in practical situations, along with the cos of making a mistake such as declaring a result to be significant when in fact it’s not.  

Proper inference requires full reporting and transparency

This principle follows on nicely from the one before – the notion that the full process of data collection, cleaning and model selection should be part of a researcher’s reporting, not just a single number.

A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

Quite so – we have a whole range of effect size statistics that measure the size of an effect. Have another look at the definition – there’s nothing there about the practical significance, only the statistical significance.

By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

It should be unusual for a data analysis to finish with a p value. In particular, a large p-value (p > 0.05) need not be the end of the story. The particular model you fitted didn’t show a statistically-significant effect for some variable … but maybe some other model, with some other variables, would show a statistically-significant effect.

So what to do? Keep an eye on the culture in your discipline – is it moving away from p-values and towards the simultaneous reporting of a point estimate and confidence interval? Keep the definition of a p-value firmly lodged in your memory, so that you lessen the risk of the tempting misinterpretations. And keep the contact details of your local statisticians close by too. They are happy to advise on the statistical aspects of your project from the beginning design phases right through to the detailed reporting and correct interpretation of results.

Associate Professor Alice Richardson is Director of the Statistical Consulting Unit (SCU) at the Australian National University. Her research interests are in linear models and robust statistics; statistical properties of data mining methods; and innovation in statistics education. In her role at the SCU she applies statistical methods to large and small data sets, especially for research questions in population health and the biomedical sciences.

What method do I need to analyse my data?

In the FAQ series we will elaborate on common questions that come across our desk. This is not so you wouldn’t ask us those questions anymore, but to provide you a deeper understanding on how to formulate your questions more specific to your research context.

On the surface it seems like the right question to ask a statistician. “What method do I need to analyse my data?” But don’t be surprised if the answer you get is: “It depends”.

Let’s dig a little bit deeper to understand why this question doesn’t generate a specific answer.

When academics think about their research, they commonly have a linear process in mind. A research question gets translated into a data collection and once the data are collected they need to be analysed. Or, alternatively, certain data become available to use and generate some questions. Then statistics are being used to answer those questions.

In the statistician’s mind, the research question (problem), the data and the analysis are all connected. It might help to think of them as being part of a triangle.

A problem cannot be resolved without data, and an analysis requires data, but to the same extent the analysis also needs to be capable of providing an answer to the problem. When you consider the analysis method after the data have been collected it is possible that it would become rather difficult to provide an analysis method that both can deal with the data characteristics as well as answering the problem.

Or to put it in Ronald Fisher’s words:

To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.

~ RA Fisher 1938

So asking the question around which method you need, should come sooner rather than later. A well thought out research plan has considered all possibilities with respect to what data are needed to answer the research questions as well as which analysis is required to provide insight to those answers that suit the data characteristics.

Try to avoid thinking about your research plan as being linear. It is not a sequential timeline. All pieces of the puzzle need to fit before you can plan them out to be executed over time.

Marijke joined the Statistical Consulting Unit in May 2019. She is passionate about explaining statistics, especially to those who deem themselves not statistically gifted.