## Abstract

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This twelfth installment of *Explorations in Statistics* explores the assumption of normality, an assumption essential to the meaningful interpretation of a *t* test. Although the data themselves can be consistent with a normal distribution, they need not be. Instead, it is the theoretical distribution of the sample mean or the theoretical distribution of the difference between sample means that must be roughly normal. The most versatile approach to assess normality is to bootstrap the sample mean, the difference between sample means, or *t* itself. We can then assess whether the distributions of these bootstrap statistics are consistent with a normal distribution by studying their normal quantile plots. If we suspect that an inference we make from a *t* test may not be justified—if we suspect that the theoretical distribution of the sample mean or the theoretical distribution of the difference between sample means is not normal—then we can use a permutation method to analyze our data.

- bootstrap
- Central Limit Theorem
- normal quantile plot
- permutation methods

this twelfth paper in *Explorations in Statistics* (see Refs. 3–12, 15) explores the tacit assumption that the data we want to analyze, often with a *t* test, are distributed normally. In this exploration we will review familiar tools and we will learn new tools we can use to investigate whether our data meet these assumptions of normality, and we will review our options if our data fail to meet these assumptions. Before we do that, we will delve into the assumption that the result of a *t* test is meaningful only if the data are distributed normally, or at least roughly so.

### The Assumption of Data Normality: an Overview

When we explored the bootstrap (6) we learned that the results of a *t* test—its *P* value and corresponding confidence interval—are meaningful only if the theoretical distribution of the sample mean is roughly normal (see also Refs. 3–5, 10, 14).^{1} This can happen in one of two ways. The first way: if our observations are drawn from a population that *is* distributed normally (Fig. 1). This more restrictive way is perpetuated in many textbooks and online resources. The second way: if our observations are drawn from *almost* any population as long as our sample size *n* is big enough (see Fig. 1).^{2} This more general way is a manifestation of the Central Limit Theorem (14, 19, 21).

Most populations have a mean μ and standard deviation σ. As we did in our inaugural exploration (3), suppose we draw from one of these populations an infinite number of samples, each with a-big-enough-*n* observations. The infinite number of sample means, *ȳ*_{1}, *ȳ*_{2}*, . . . ȳ_{∞}*, will be distributed normally with mean μ and standard deviation . In other words, the average of the sample means, Ave {

*ȳ*}, will be the population mean μ, but the standard deviation of the sample means, SD {

*ȳ*}, will be smaller than the population standard deviation σ by a factor of (Fig. 2).

With this brief overview, we are almost ready to begin our exploration of the assumption of normality. First, we need to review the software we will use to help us learn about it.

### R: Basic Operations

The first paper in this series (3) summarized R (22) and outlined its installation. For this exploration there are three more steps: download Advances_Statistics_Code_Normal.R^{3} to your Advances folder, confirm you installed boot in our previous explorations (6, 15), and install the extra package nortest (16).^{4}

*To run R commands*. If you use a Mac, highlight the commands you want to submit and then press ⌘↲ (command key+enter). If you use a PC, highlight the commands you want to submit, right-click, and then click `Run line or selection`. Or highlight the commands you want to submit and then press Ctrl+R.

### Procedures to Assess Normality

The most versatile approach to assess normality is to bootstrap the sample mean (for a one-sample *t* test) or to bootstrap the difference between sample means (for a two-sample *t* test) (6, 10). We can then assess whether the distribution of either bootstrap statistic is consistent with a normal distribution by generating a normal quantile plot (1, 6); see Fig. 3. We did this when we explored the bootstrap (Fig. 4) and permutation methods (Fig. 5). Why is this the most versatile approach? Because what we really care about is not the distribution of the observations themselves, but the theoretical distribution of the sample statistic: the sample mean or the difference between sample means. And we know that a bootstrap distribution estimates the theoretical distribution of some sample statistic (6).

We know also that if our sample is too small, then the bootstrap distribution is likely to differ—perhaps a lot—from the theoretical distribution of our sample statistic (6). If we suspect our sample is too small for the bootstrap to be useful, what recourse do we have? We can create a normal quantile plot of the observations themselves (Fig. 6). In this situation, however, our interpretation of the normal quantile plot should be tentative.

We can also assess normality using formal statistical tests (2, 16, 20). For each of these tests the implicit null and alternative hypotheses, *H*_{0} and *H*_{1}, are

*H*_{0}: The sample observations are consistent with having come from a normal distribution.*H*_{1}: The sample observations are not consistent with having come from a normal distribution.

We can get a sense of these tests if we use the samples of 100 observations depicted in Fig. 3. Suppose we define our critical significance level α, the probability that we reject a true null hypothesis, to be 0.10 (5, 13). The observations drawn from the normal distribution are consistent with having come from a normal distribution: 0.50 *≤ P ≤* 0.67 (Table 1). In contrast, the observations drawn from the log-normal and uniform distributions are not consistent with having come from a normal distribution: 0.0001 *≤ P ≤* 0.08. These formal results reinforce our interpretations of the corresponding normal quantile plots (see Fig. 3).

We can also apply these statistical tests to two-sample problems (Table 2).

When we assess normality, it is important to use graphical techniques and formal statistical tests in tandem: the mere numeric result of a formal statistical test can mislead us (see Ref. 5). With a large number of observations, a statistical test may detect a small, inconsequential departure from normality. With a small number of observations, a statistical test may have insufficient power (8) to detect a meaningful departure from normality.

### Practical Considerations

At the crux of meaningful results from a *t* test is the assumption that the theoretical distribution of the sample mean^{5} is roughly normal (3–6, 10, 14, 21). This assumption can be satisfied regardless of the distribution of the actual observations as long as our sample size is big enough. If our sample size is not big enough—if the Central Limit Theorem fails to hold—then we can transform the observations in the hopes that the theoretical distribution of the sample mean will be more normally distributed (see Fig. 4 and Ref. 6).

It is now time that we tackle the question, how big is *big enough*?

If our sample observations happen to be drawn from a population that is distributed normally, then any sample size is big enough for the theoretical distribution of the sample mean to be distributed normally (19, 21). See Fig. 7. If we analyze just three observations using a *t* test, then we are banking on this tenuous assumption.^{6} With three observations there is simply no way to know whether the underlying population is distributed normally (see Ref. 10).

If our sample observations are drawn from a population that is not distributed normally, then the answer to the question, how big is *big enough*?, is, it depends. The conventional rule-of-thumb is that a sample size of 30 is big enough for the theoretical distribution of the sample mean to be distributed roughly normally, even when the underlying population is skewed. But this convention is mere urban legend (17, 18, 21). When we explored the bootstrap, we discovered that a sample size of 40 was not big enough (see Ref. 6, Fig. 7).

In 1986 Moses wrote that a sample size might need to be quite large for the theoretical distribution of the sample mean to be distributed roughly normally—for the results of a *t* test to be meaningful—when the distribution of some underlying population was not normal (21). In 2014 Hesterberg, using simulations, confirmed that a sample size might need to exceed 5,000 observations when a population was even moderately skewed (17).

If we draw sample observations from a skewed population, the log-normal distribution in Fig. 3, we can see for ourselves that the empirical distribution of sample means is not normal, even for samples of 512 observations (Fig. 8).

Because a bootstrap distribution of the sample mean can depart from normality in subtle ways, Hesterberg advocates bootstrapping *t* itself in addition to the sample mean (17, 18). In our second exploration (5), we used *t* to assess whether our sample observations were consistent with having come from a population with mean μ. We calculated *t* as
where *ȳ* is the sample mean, *s* is the sample standard deviation, and *n* is the sample size, the number of observations in our sample.

We can generate a bootstrap distribution of *t* as
where and represent the mean and standard deviation of bootstrap sample *j* and where *ȳ* represents the original sample mean. Suppose we draw *n* = 1,024 sample observations from the log-normal distribution in Fig. 3 and we then create *j* = 10,000 bootstrap samples. The bootstrap distribution of the sample mean is skewed, and the bootstrap distribution of *t* is even more so (Fig. 9). In the theoretical distribution of *t* with 1,024 *−* 1 degrees of freedom, 2.5% of the possible values of *t* are less than *−*1.962, and 2.5% of the possible values of *t* are greater than +1.962. In the skewed bootstrap distribution of *t* (see Fig. 9), 5.8% of the possible values of *t*^{∗} are less than *−*1.962, and 0.7% of the possible values of *t*^{∗} are greater than +1.962. The commands in *lines 401–405* of Advances_Statistics_Code_Normal.R return these values. Your values will differ slightly.

Lucky for us, when we bootstrap the sample mean *ȳ* or the statistic *t*, we do not have to guess about whether our sample size is big enough for the Central Limit Theorem to hold, about whether an inference we make from the traditional *t* test is justified. Instead, we can actually check.

If we suspect that an inference we make from a traditional test may not be not justified—if we suspect that the theoretical distribution of the sample mean is not normal—then we can use a permutation method in conjunction with the traditional test (10).^{7} If our conclusion from permutation matches our conclusion from the traditional test, then the normality assumption for the traditional procedure is at least reasonably well met. If our conclusion from permutation conflicts with our conclusion from the traditional test, then we should suspect that the normality assumption has not been met. We want to opt not for the statistical procedure that produces our dream result, but for the statistical procedure that has its assumptions best satisfied (10).

### Summary

We knew already that the *P* value and confidence interval from a *t* test are meaningful only if the theoretical distribution of the sample mean—or the theoretical distribution of the difference between sample means—is roughly normal. As this exploration has demonstrated, the most versatile approach to assess normality is to bootstrap the sample mean, the difference between sample means, or *t* itself. We can then assess whether the distributions of these bootstrap statistics are consistent with a normal distribution by perusing their normal quantile plots. Although we can also assess normality of the actual observations using formal statistical tests, it is important to use graphical techniques in conjunction with those formal tests.

## DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author.

## AUTHOR CONTRIBUTIONS

D.C.-E. drafted manuscript; edited and revised manuscript; approved final version of manuscript.

## ACKNOWLEDGMENTS

I thank Arindam Ganguly (Cornell University, Ithaca, NY) for suggesting this paper, and I thank Gerald DiBona (Göteborgs Universitet, Göteborg, Sweden and University of Iowa College of Medicine, Iowa City, IA), Arindam Ganguly, Andrew Greene (Medical College of Wisconsin, Milwaukee, WI), and Calvin Williams (Clemson University, Clemson, South Carolina) for their helpful comments and suggestions.

## Footnotes

↵1 The results of a two-sample

*t*test are meaningful only if the theoretical distribution of the difference between sample means is roughly normal (10).↵2 There are no hard-and-fast rules for how big

*big enough*is. See*Practical Considerations*.↵3 This file is posted as a Supplemental File under Figures & Data at the

*Advances in Physiology Education*website.↵4 Some of our previous explorations (6, 7, 10–12, 15) detail how to install an extra R package.

↵5 Or the theoretical distribution of the difference between sample means.

↵6 This is also true if we compare two groups of three observations.

↵7 Permutation is one kind of nonparametric procedure. Other nonparametric procedures include the sign test, the Wilcoxon signed-rank test, and the Wilcoxon rank-sum test (also known as the Mann-Whitney

*U*or the Mann-Whitney-Wilcoxon test).

- Copyright © 2017 the American Physiological Society