Normal random samples
The Normal distribution
Recap of the essentials
You will have met the Normal distribution with mean \(\mu\) and variance \(\sigma^2\), denoted \(N(\mu,\sigma^2)\), in MA/MT11310. This is the most common distribution for continuous random variables, \(X\), e.g. height, weight, cost etc. It has probability density function (pdf) given by
This function does not have a closed-form anti-derivative, hence the need for Normal statistical tables.
The Standard Normal, often denoted \(Z\), is the Normal distribution with mean 0 and variance 1, i.e. \(Z\sim N(0,1)\). Percentage points for this distribution are readily found in statistical tables.
Any non-standard Normal distribution can be standardised. If \(X\sim N(\mu,\sigma^2)\), then \(\frac{X-\mu}{\sigma}\sim N(0,1)\).
Is my data Normally distributed?
Histograms, though sometimes straightforward and easy to interpret, can suffer from their arbitrary choice of intervals. Moreover, an argument solely based on “it looks vaguely symmetrical” aren’t really good enough. Lots of other non-Normal statistical distributions also look symmetrical.
The most useful plot for assessing Normality is the Normal Probability Plot. This is an example of a Quantile-Quantile or Q-Q plot, in which the observed quantiles of the data are plotted against the expected quantiles of the best fitting distribution of a given family, here the Normal. If the Normal assumption is justified we get a straight line.
In each Q-Q plot, the data is plotted on the \(Y\)-axis (hence actual, observed order statistics) while the \(X\)-axis gives the expected order statistic assuming a Normal distribution. In other words the minimum data value is plotted against the expected minimum, the 2nd smallest against its expected value and so on until the data maximum. A Normally distributed sample will appear as approximately a straight line.
Long-tailed data sets tend to appear as an elongated curve or double-bend while if the data is less spread out than the Normal it will be more S-shaped, squashed somewhat in the vertical direction. Skewed data can also be recognised because many points will be concentrated at one end of the range, where the curve will rise slowly, while in the tail the data will be spread out and the curve will rise steeply. Finally discrete data will appear as horizontal rows of points concentrated on a few, often integer, values.
Note
Q-Q plots are the best graphical way to judge Normality - far better than a histogram. Just because a histogram looks vaguely symmetrical does not mean that underlying Normality is necessarily a justifiable assumption… “Normal” means more than “a bit symmetrical” – many non-Normal distributions are symmetrical.
Sums of Independent Normals
The sum of two independent Normals
If \(X\) and \(Y\) are independent variables with \(X\sim N(\mu_1,\sigma_1^2)\) and \(Y\sim N(\mu_2,\sigma_2^2)\), then \(X+Y\) is also Normal, with
Note
As always with independent random variables, we can add the variances but not the standard deviations.
The Central Limit Theorem
One reason why the Normal distribution is so common is that when \(Y\) is the sum of a large number \(n\), of independent random variables, \(X_i\), i.e.
Total times with many independent stages.
Total revenue or expenditure on many independent items.
The average weight of a large sample of similar items.
The average wage of a large number of similarly qualified workers.
Distribution of the sample mean
Suppose \(X_1,X_2,X_3\ldots,X_n\) are a random sample, i.e. independent with the same distribution and suppose also they are each \(N(\mu,\sigma^2)\). We will often be interested in the distribution of the sample mean
It follows that:
\(\bar X\) also has a Normal distribution.
Its expected value is \(\mu\), the mean of each \(X_i\), i.e. the population mean. In other words, \(\bar X\) is an unbiased estimator of \(\mu\).
Its variance is \(\text{Var}(\bar X)=\sigma^2/n\) and so its standard deviation is \(\mathrm{SD}(\bar X)=\sigma/\sqrt{n}\).
To summarise: $$ \bar{X} \sim N\left(\mu,\frac{\sigma^2}{n}\right). $$
Warning
In practical applications, we often wish to estimate the population mean, \(\mu\), from a sample. If we don’t know the true value of \(\mu\), we’re of course almost always also not going to know the true value of the population variance \(\sigma^2\). If \(\sigma^2\) is unknown, we must estimate it from the data by using the sample variance, \(S^2\). This estimate brings extra error, thus changing the distribution of \(\bar{X}\) – see the section on \(t\)-tests for details.
Note that the variance of \(\bar X\) will decrease as \(n\) increases, reflecting the greater information about the true mean, \(\mu\), available in a larger sample.
As an illustration, the table below gives the sample mean for random samples of size \(n=2, 5, 25, 100\) if \(X \sim N(10, 100)\):
Sample size, \(n\) |
Sample mean, \(\bar{X}\) |
Distribution of \(\bar X\) |
Standard deviation of \(\bar{X}\) |
---|---|---|---|
2 |
\(\{X_1+X_2\}/2\) |
\(N(10,50)\) |
\(\sqrt{50}=7.07\) |
5 |
\(\{X_1+\ldots+X_5\}/5\) |
\(N(10,20)\) |
\(\sqrt{20}=4.47\) |
25 |
\(\{X_1+\ldots+X_{25}\}/25\) |
\(N(10,4)\) |
2 |
100 |
\(\{X_1+\ldots+X_{100}\}/100\) |
\(N(10,1)\) |
1 |
Note
This explains why in any experiment, taking a bigger sample gives a better estimate of the underlying population mean \(\mu\). It’s because as \(n\) increases, \(\mathrm{Var}(\bar{X})\) decreases.
Unknown variance and the T statistic
As noted above, the population variance \(\sigma^2\) is usually unknown. Thus, while we wrote above that the standard deviation of \(\bar X\) is \(\sigma^2/\sqrt{n}\) (the square root of its variance, by definition of standard deviation), we cannot say what its value actually is. However, we can estimate the value of \(\sigma^2\) from our sample (via the sample variance, usually denoted \(S^2\)) and thus estimate the standard deviation of \(\bar X\).
\(ESE(\bar X)\) and the \(T\) statistic
Recall that the unbiased estimator of \(\sigma^2\) is the sample variance \(S^2\) defined by
Warning
While it’s true that \(S^2\) is an unbiased estimator of \(\sigma^2\), this actually does not imply that \(S\) is an unbiased estimate of \(\sigma\). In this module, we won’t dwell too much on this perhaps counter-intuitive point, which arises from the fact that \(\mathbb{E}[S]\neq\sqrt{\mathbb{E}[S^2]}\).
The standard deviation of \(\bar X\) is now estimated by the Estimated Standard Error $$\text{ESE}(\bar X)=\frac{S}{\sqrt{n}}.$$
Note
It’s important to recognise the difference between \(SD(\bar{X})={\sigma}/{\sqrt{n}}\) and \(ESE(\bar{X})={S}/{\sqrt{n}}.\) The former really is the standard deviation of \(\bar{X}\), but depends on the generally unknowable population variance \(\sigma^2\). The latter is an estimate of the former from the sample data.
Standardising the sample mean now gives us
The \(t\)-distribution is symmetric about zero like the Normal but more spread out in the tails. Its spread depends on the sample size \(n\) via a parameter called the degrees of freedom (df), \(\nu=n-1\). Note that the df is the same as the divisor in the unbiased estimator of variance.
Assumptions for \(t\)
There are two assumptions needed for \(t\)-tests or confidence intervals to be valid:
\(X_1,X_2,\ldots,X_n\), are a random sample i.e. independent observations from a single distribution.
Each \(X_i\) is \(N(\mu,\sigma^2)\), i.e. Normally distributed with mean and variance \(\sigma^2\).
The first assumption is extremely important. It is usually a consequence of a well designed experiment or investigation where conditions are kept constant and observations are taken independently. If the observations form a time series, they may follow a trend or be autocorrelated. We may be able to spot this by plotting them sequentially in time.
Lack of independence can lead to very misleading inference. For example, if the \(X_i\) are positively correlated, then the true variance of will be greater than \(s^2/n\), the usual one-sample \(t\) confidence interval will then be too narrow and the one-sample \(t\)-test will exaggerate the significance level.
Secondly what is really important is that the data do indeed come from a single sample with constant mean and variance, rather than a mixture of different groups. The assumption that the original data are Normal is not quite so important because the t-distribution is robust to moderate departures from the Normal distribution, particularly in large samples. This is due to the central limit theorem which states that, provided \(\mathrm{Var}(X)\) is finite, converges to Normal as \(n\to\infty\). In practice, tests and confidence intervals are not much affected unless the data is highly skewed. However we can and should check whether the Normal distribution is reasonable, for instance by examining a Normal Q-Q plot.
Hypothesis tests
Introduction to the \(t\)-test
Assume that \(X_1,X_2,\ldots X_n\) are a random sample, independent \(N(\mu,\sigma^2)\), but suppose we need to answer a specific question about the mean response, \(\mu\). For example, is the mean 10 (say)? Or greater than 10? How strong is the evidence that the mean is greater than zero?
We use a test of hypotheses, the one sample \(t\)-test. The key result is that
We assess the evidence against an assumed value, the Null Hypothesis denoted by \(H_0\), by calculating a probability, \(p_0\), called the p-value, or observed level of significance. This is the probability of exceeding the observed test statistic and is calculated under the assumption that the Null Hypothesis is true. Here ‘exceeding’ could mean either too high or too low depending on which Alternative Hypothesis, \(H_1\), we consider.
Hypothesis testing - the background
Let’s cover some general points about statistical tests of hypotheses, illustrating them with the case of the \(t\)-test for a Normal mean, \(\mu\).
The hypotheses
Tests involve two conflicting hypotheses.
Firstly there is the null hypothesis, \(H_0\), e.g. in the \(t\)-test \(H_0: \mu=\mu_0\). This is assumed to be true unless the evidence is strong enough to reject it.
Secondly there is an alternative (or alternate) hypothesis, \(H_1\), e.g. in the \(t\)-test \(H_1: \mu>\mu_0\) (or \(\mu<\mu_0\) or \(\mu\neq \mu_0\) depending on the problem). We would require sufficient evidence before deciding that the alternative is true.
If the evidence is strong enough we reject \(H_0\) in favour of \(H_1\). Otherwise we say that there is insufficient evidence to reject \(H_0\). It is best to avoid saying “accept \(H_0\)”. For instance, it may be that with more data we might reject \(H_0\). It is also possible that a different test, perhaps for a more appropriate or relevant alternative, would lead to rejection.
Interpreting p-values
The following gives a useful guide to translating \(p_0\) into a verbal conclusion. Note that if \(p_0 \leq \alpha\) then we can say that \(H_0\) is rejected at the \(100\alpha\%\) level.
\(p_0\leq0.05\): significant at the 5% level, moderate evidence against \(H_0\), in favour of \(H_1\).
\(p_0\leq0.01\): significant at the 1% level, strong evidence against \(H_0\), in favour of \(H_1\).
\(p_0\leq0.001\): significant at the 0.1% level, very strong evidence against \(H_0\), in favour of \(H_1\).
\(p_0\leq0.0001\): significant at the 0.01% level, extremely strong evidence against \(H_0\), in favour of \(H_1\).
\(p_0>0.1\): not significant at the 10% level, little evidence against \(H_0\), not enough to reject.
\(0.05 \leq p_0 \leq 0.1\) : not significant at the 5% level, but significant at 10%. A borderline case: sometimes more data is needed before a firm conclusion can be reached.
Confidence Intervals
The main idea
A confidence interval is a type of interval estimate of a population parameter (often the mean, although we will consider other parameters later).
As an example to understand exactly what a confidence interval is, consider picking a random sample of five wild rabbits from the UK population of wild rabbits \(X\) and suppose we record the mass of each rabbit as \(x_1,x_2,\ldots,x_5\).
A point estimate of the mean mass of a UK wild rabbit could be obtained by simply taking the sample mean mass of the five rabbits, \(\bar X\). Used by itself, \(\bar X\) is of limited usefulness because it provides no information about its own reliability.
We may have however by pure chance picked a sample of rabbits that are unusually heavy, or unusually light. Picking another sample instead of our chosen five would of course give us some other value of \(\bar X\).
We shall use a procedure based on our sample \(x_1,x_2,\ldots x_5\) to compute a 95% (say) confidence interval. If this procedure were to be repeated on multiple samples, the calculated confidence interval (which would differ for each sample) would include the true population parameter (in this case population’s mean rabbit mass) 95% of the time.
Confidence interval for \(\mu\) (with variance estimated from the sample)
We now derive a formula for a \(100(1-2\alpha)\%\) confidence interval for \(\mu\) which is the mean of the population from which the sample was taken. For instance for 98%, \(\alpha=0.01\), for 95% \(\alpha=0.025\) and so on. The model we assume is that \(X_1, X_2, \ldots,X_n\), are a random sample from a Normal distribution i.e. the \(X_i\) are independent \(N(\mu,\sigma^2)\).
Now, we have already established that \(T=\frac{\bar X-\mu}{S/\sqrt{n}}\sim t _{[n-1]}\), i.e. the t-statistic which can be calculated for any sample is distributed as a t distribution on \(n-1\) degrees of freedom.
If we can find two constants, let’s call them \(\pm {{t}_ \alpha}_{[n-1]}\), so that
After some rearrangement this yields
Interpretation
The interval is a random one since it will vary, depending on the outcome of the experiment. It will however cover the true \(\mu\) with probability \(1-2\alpha\). If we were to continually repeat the experiment and calculate a confidence interval for each new set of data, then a proportion \(1-2\alpha\), of these random intervals would include the true but unknown \(\mu\). For example with a 95% interval, we would “get it right” 95% of the time.
The \(Z\) confidence interval for \(\mu\)
Rather rarely the variance of the data is known and does not, therefore, have to be estimated from the data. Now