Linear regression

Scatterplots

Types of scatterplot

These are useful for displaying the relationship between two variables \(X\) and \(Y\). There are three possibilities:

Both \(X\) and \(Y\) are measurement data and both are random variables.
Both \(X\) and \(Y\) are measurement data but X is selected by the experimenter.
\(Y\) is a measurement variable but \(X\) is merely a code for an ordinal or nominal categorical variable.

There are two common ways in which this type of data is gathered:

We take a random sample of \(n\) individuals or items from a population and observe \(n\) independent pairs \((x_i,y_i)\). The experimenter fixes \(n\) values of \(x_i\) and measures \(y_i\) subject to independent errors.

Note

One reason why we should always plot \(y_i\) versus \(x_i\) before fitting any regression model is to see if fitting a straight line makes sense or whether some other model may be appropriate. In some cases to get a straight line it may be necessary to a transform one or both of \(x\) and \(y\), e.g. by taking logs, reciprocals, square roots or other powers.

Least squares estimation

Obtaining the least squares estimator

The Least Squares method uses the \(n\) data points \((x_i,y_i)\) to estimate \(\beta_0\) and \(\beta_1\) by minimising the sum of squares of the residuals. The residuals are the vertical errors from the data points \(y_i\) to the fitted line, namely \((y_i-\{\beta_0+\beta_1x_i\})\).

We therefore minimise

\[ S(\beta_0,\beta_1) = \sum\limits_{i=1}^n (y_i - \{\beta_0 + \beta_1 x_i \})^2.\]

This is just a problem of minimising a function, i.e. a simple calculus problem. We find (see your lecture notes for the full derivation) that the least squares estimators are given by \[\hat{\beta}_1=\frac{S_{xy}}{S_{xx}},\quad \hat{\beta_0}=\bar{y}-\hat{\beta}_1\bar x.\]

The fitted line \(y=\hat\beta_0+\hat\beta_1x\) is known as the least squares regression line of \(Y\) on \(X\).

Calculating \(S_{xx}\) and \(S_{yy}\)

The quantity \(S_{xx}\) is called the corrected sum of squares of the \(x_i\), and may be calculated either using the identity \[ \begin{aligned} S_{xx}&=\sum\limits_{i=1}^n (x_i-\bar x)^2 \\ &=\sum\limits_{i=1}^n x_i^2 -\frac{1}{n}\left(\sum\limits_{i=1}^n x_i\right)^2,\end{aligned}\] or by first calculating \(S_x\), the standard deviation of the \(x_i\)’s and then using \[S_{xx}=(n-1)S_x^2.\]

The corrected sum of products, \(S_{xy}\), may be calculated using \[ \begin{aligned}S_{xy}&=\sum\limits_{i=1}^n(x_i-\bar x)(y_i-\bar y) \\ &=\sum\limits_{i=1}^n x_iy_i-\frac{1}{n}\sum\limits_{i=1}^n x_i\sum\limits_{i=1}^n y_i.\end{aligned} \]