Descriptive statistics

Classification of data

This topic is important because it helps us decide which diagrams and summary statistics will be useful and which confusing. We will see how to create many of these diagrams using R/RStudio in the computing practicals.

Quantitative vs qualitative

Quantitative: The observed variable takes different numerical values eg weight, price.

Qualitative: The variable describes a characteristic eg sex or hair colour. Observations can be grouped into categories, eg Male or Female. For convenience, they may be given numerical codes for computing purposes but these are not to be taken literally.

Discrete vs continuous

Discrete: The possible values are distinct and have, or can be given, a unique numerical code (e.g. 0/1 for die/live. Discrete data can be nominal (order doesn’t matter e.g. eye colour) of ordinal (order does matter, e.g. preferences)

Continuous: The number of possible values is very large (strictly speaking uncountably infinite). Continuous variables are often measurements e.g. temperature, length. Another distinction is ratio/interval. Interval variables are variables for which talking about the arithmetic difference of two observations makes sense (e.g. it’s 2 degrees C hotter today than yesterday). If talking about ratios of two observations makes sense then the variable is of ratio type. Ratio variables take a value of zero if there is a lack of that variable. Money, for instance, is a ratio type variable (£0 means lack of money, statements like ‘x costs twice as much as y’ make sense). Temperature in degrees C is interval, but not ratio (30C isn’t twice as hot in 15C in any meaningful sense - converting into Fahrenheit would convince you of this).

Note

Large counts, though strictly speaking discrete, can often be treated as being effectively continuous. For instance, the number of children vaccinated against some disease in a country is strictly speaking discrete (you can’t vaccinate half a child!), but since the numbers will in general be very large, the kinds of analyses and data presentations that we usually apply to continuous data will be more appropriate.


Using classification to inform plot choice

Often, our job as statisticians is to summarise the data and present it in an easily interpretable fashion. Often this will be in the form of a plot. It’s therefore important, bearing data type classification in mind, to make a sensible choice of plot that best summarises the data.

Discrete data

Discrete data can be conveniently summarised in tables and using proportions or percentages. They may be displayed using barcharts or piecharts. Barcharts are a very flexible tool since the various options, eg stacked/clustered, horizontal/vertical, count/percent, can be used to shed light on different questions. Piecharts are often less informative since it is more difficult to distinguish between similarly sized regions.

Continuous data

Continuous data can be summarised using grouped frequency tables and summary statistics, eg averages. They can be displayed using dotplots, stem-and-leaf diagrams or histograms. Box-and-whisker plots provide a useful summary of the distribution when there is a lot of data.


Summary statistics

Summary statistics Suppose we’re given a list of a couple of hundred measurements of the heights of plants. One of the first things we might want to do to understand the data is to compute a few summary statistics: numbers that describe important aspects of the data. So what are we looking for when we seek summary statistics? Well, consider roughly in order of importance,

  1. Location: What is a “representative” value? The “centre” of the data?

  2. Spread: How variable is the data? What range or dispersion?

  3. Shape: Are there unusual or outlying values? Mistakes? Or crucial evidence? Is there more than one group? Bimodal? Multimodal? Symmetric or skewed? If skewed is the longer tail on the right (positively skewed)? Or on the left (negatively skewed)?

  4. Important ranges: Are there target specifications? What proportion lie inside them? Are there key thresholds? What proportion lie above or below them?

Measures of central tendency

Suppose we have a sample of measurements, say \(n\) of them, and suppose they have been taken at random so as to be representative. Let’s call them \(x_1,x_2,\ldots,x_n\).

The median: \(m\)

50% of data lie below \(m\), 50% above. We put data in increasing order. Then:

  • if \(n\) is odd then \(m\) is the middle value.

  • if \(n\) is even then \(m\) is the average (mean) of the middle two.

The median can also be found easily from a stem and leaf diagram.

The mean: \(\bar x\)

Also known as the sample mean, artithmetic mean, average. Calculated as

\[\begin{split}\begin{aligned}\bar x &= (x_1+x_2+x_3+\ldots+x_n)/n \\ &= \frac{1}{n}\sum\limits_{j=1}^n x_j.\end{aligned}\end{split}\]

The mode

Defined as the most frequently occurring value. For measurement data this is rarely of interest. However, in discrete data, especially if there are only a few possible values, the mode is often of intrinsic interest simply because it is the most frequently observed value. For example, the most popular of 5 brands of soap would be of interest.

Mean vs median

The mean makes more efficient use of all the data but is strongly affected by outliers (if present).

The median is unaffected by outliers. It is outlier resistant.

Measures of spread

The Range

This is just the difference between the maximum and minimum values, i.e. max-min. Clearly it will be entirely decided by the outliers.

The Interquartile Range (IQR)

This is the range of the central 50% of the data.

The lower quartile, often denoted \(Q_1\), has 25% of the data below it, and 75% of the data above it.

The upper quartile, \(Q_3\), has 75% of the data below it, and 25% of the data above it.

The IQR is then calculuated as \(IQR=Q_3-Q_1\).

The IQR is totally unaffected by outliers but only measures how dispersed the central 50% of the data is. The range and the IQR therefore complement each other.

The sample standard deviation (\(s\))

Often written as \(SD(x)\), this is the square root of the sample variance, i.e. \(\sqrt{s^2}\), where

\[\begin{split}\begin{aligned} s^2&=\frac{1}{n-1}\sum\limits_{i=1}^n(x_i-\bar x)^2\\&=\frac{1}{n-1}\left\{\sum\limits_{i=1}^n x_i^2-\frac{1}{n}\left(\sum\limits_{i=n}^n x_i\right)^2\right\}.\end{aligned}\end{split}\]

\(s^2\) is a type of average of the squared deviations of the data from its mean value. Why divide by \((n-1)\) rather than \(n\)? This gives the unbiased estimator of the variance of the whole population. Dividing by \(n\) tends to give too low a value.

Calculating \(s^2\) is best done using a computer package like R or by using the STAT mode on your calculator. Computing via the first equality above is slow and error-prone. The second equality is mainly of theoretical interest but is sometimes useful if no computer is available and the data is given only as a grouped frequency table.

Note

All three measures, (range, IQR and standard deviation) are in the same units as the original data. The sample variance, \(s^2\), is in (units)\(^2\).

The Box and Whisker Plot

This is a useful graphical technique for summarising and comparing medium to large sets of data. It is based on the so-called five-number summary, which consists of:

  • The median, represented by the line inside the box.

  • The upper and lower quartiles, represented by the upper and lower sides of the box.

  • The maximum and minimum, represented by the extreme ends of the ‘whiskers’ provided these are no more than \(1.5 \times IQR\).

  • Outliers, any data points beyond \(1.5 \times IQR\), are represented by an asterix or dot.

For multiple boxplots, the width of each box is sometimes plotted to be propotional to the square root of the sample size in order to indicate the size of each group.

Warning

It’s important to note that boxplots show the median, not the mean. A common pitfall when interpreting boxplot is to start waxing lyrical about how it shows something about the mean. It does not (although some boxplots do add the mean as an extra marker); the line across the box is the median.