Linear regression

Scatterplots

Types of scatterplot

These are useful for displaying the relationship between two variables \(X\) and \(Y\). There are three possibilities:

  1. Both \(X\) and \(Y\) are measurement data and both are random variables.

  2. Both \(X\) and \(Y\) are measurement data but X is selected by the experimenter.

  3. \(Y\) is a measurement variable but \(X\) is merely a code for an ordinal or nominal categorical variable.

There are two common ways in which this type of data is gathered:

  • We take a random sample of \(n\) individuals or items from a population and observe \(n\) independent pairs \((x_i,y_i)\). The experimenter fixes \(n\) values of \(x_i\) and measures \(y_i\) subject to independent errors.

Note

One reason why we should always plot \(y_i\) versus \(x_i\) before fitting any regression model is to see if fitting a straight line makes sense or whether some other model may be appropriate. In some cases to get a straight line it may be necessary to a transform one or both of \(x\) and \(y\), e.g. by taking logs, reciprocals, square roots or other powers.

Least squares estimation

Obtaining the least squares estimator

The Least Squares method uses the \(n\) data points \((x_i,y_i)\) to estimate \(\beta_0\) and \(\beta_1\) by minimising the sum of squares of the residuals. The residuals are the vertical errors from the data points \(y_i\) to the fitted line, namely \((y_i-\{\beta_0+\beta_1x_i\})\).

We therefore minimise

\[ S(\beta_0,\beta_1) = \sum\limits_{i=1}^n (y_i - \{\beta_0 + \beta_1 x_i \})^2.\]

This is just a problem of minimising a function, i.e. a simple calculus problem. We find (see your lecture notes for the full derivation) that the least squares estimators are given by \[\hat{\beta}_1=\frac{S_{xy}}{S_{xx}},\quad \hat{\beta_0}=\bar{y}-\hat{\beta}_1\bar x.\]

The fitted line \(y=\hat\beta_0+\hat\beta_1x\) is known as the least squares regression line of \(Y\) on \(X\).

Calculating \(S_{xx}\) and \(S_{yy}\)

The quantity \(S_{xx}\) is called the corrected sum of squares of the \(x_i\), and may be calculated either using the identity \[ \begin{aligned} S_{xx}&=\sum\limits_{i=1}^n (x_i-\bar x)^2 \\ &=\sum\limits_{i=1}^n x_i^2 -\frac{1}{n}\left(\sum\limits_{i=1}^n x_i\right)^2,\end{aligned}\] or by first calculating \(S_x\), the standard deviation of the \(x_i\)’s and then using \[S_{xx}=(n-1)S_x^2.\]

The corrected sum of products, \(S_{xy}\), may be calculated using \[ \begin{aligned}S_{xy}&=\sum\limits_{i=1}^n(x_i-\bar x)(y_i-\bar y) \\ &=\sum\limits_{i=1}^n x_iy_i-\frac{1}{n}\sum\limits_{i=1}^n x_i\sum\limits_{i=1}^n y_i.\end{aligned} \]