Distributions, or more specifically probability distributions, are a way of showing the occurrences of random variables. The most commonly recognisable distribution is the normal distribution, which visually is best represented by a histogram and follows a 'bell-shaped curve', showing that the concentration of the distribution is symmetrical and centred around the middle.
The histogram below shows an example of a distribution of exam scores in a class, with the bell-shaped curve plotted on top in black.
We would say that this distribution is normal because it is approximately symmetric about the middle and overall approximately follows the shape of the black line.
A histogram is not the only way to display distributions though: a scatter plot or Q-Q plot (which stands for 'quantile-quantile plot') is also good at showing distributions. A Q-Q plot is easy to use to show normality of data, although unlike a histogram it does not follow the bell-shaped symmetry. Instead, there is a diagonal line across a chart at a 45° angle, and normal data is data which more or less falls along this line. Here is the same distribution of exam scores as above, this time shown on a Q-Q plot.
We can see the black diagonal line with the blue data points which more or less run along it without any significant deviation, showing that this data is normally distributed.
Normality can also be investigated using tests such as the Shapiro-Wilk test or the Jarque-Bera test, which instead of producing graphs to display the distribution, will produce a number to be compared to the significance level of a hypothesis test.
When using normally distributed data, the mean is good for measuring central tendency or average. This is because the mean is a good measure of the average when the data is continuous and symmetrical. For non-normal data, the median tends to be a better indication as it would be less affected by the presence of outliers than the mean.
Skew or skewness is a way we can measure the shape of a distribution: in particular, skew refers to the asymmetry a distribution may experience. A distribution may be right-skewed or left-skewed, depending on which side the tail lies. Real data is most likely going to be skewed in some way. It is easy to detect skew in your data by seeing if the mean and median of your dataset are not approximately equal.
With skewed data, the mean value sometimes no longer becomes an accurate measure of central tendency: instead, the median is a better measurement as it represents the centre value in a dataset, and so more accurately describes the centre of the data distribution.
There are quite a few formulas you can use to calculate skewness, for example as Pearson's median skewness, used to show how many standard deviations lie between the mean and median of a data set, which can be calculated given by
although you may prefer to use software to calculate this. A distribution with zero skew is one which is purely symmetrical, like Normal distributions. Another distribution with no skew is the uniform distribution. Positive skew indicates a distribution that is right-tailed, and a negative skew is indicative of a left-tailed distribution.
If you particularly need to work with normal data, for example you wish to use a parametric test for inferential statistics, it may be worth trying to transform your data.
Right skewed data, also known as positively skewed data, is data which is not symmetric and tails off to the right hand side. Right skew happens when the mean value of the dataset is higher than the median value.
Let's suppose a different class than the example above took the same exam and now has the following distribution of exam scores:
Here, the 'hump' of the histogram sits towards the left hand side and tails off towards the right - this is right-skewed data and is no longer normal. Here, the students tended to do more poorly in their exam as the majority of the data lies towards the lower end of the scale. We can also visually display this on a Q-Q plot.
Most of the data points lie more or less along the black line, however near the top and bottom there is some deviation. This shows that the data is not quite normal and is therefore skewed.
Examples of right-skewed distributions are the Poisson and the Rayleigh distributions.
By contrast, left skewed data, also known as negatively skewed data, is data which is not symmetric and tails off to the left hand side. If the mean of the dataset is lower than the median, the dataset will have a left skew.
Let's suppose this time that a third class took the same exam as the other two, and their scores are shown by this histogram:
We can tell that more students performed well in this class compared to the previous two as the concentration is towards the higher end of the scale. We can also have a look at the Q-Q plot for this data which will show the same skewness:
We can see that the data points do not really follow the black line, indicating that the distribution is not a normal one.
These distributions can also be summarised using box plots, like the below. It is easy to see the approximate symmetry around the middle for class 1 and the normally distributed exam scores, the long tail on top showing the right skew in class 2, and the longer tail at the bottom of the box plot showing left skew in class 3.
It is perfectly acceptable to report the values of skew (or kurtosis) of a distribution without displaying its shape in a histogram or Q-Q plot, as long as you demonstrate understanding of what the value means and how it is used to interpret the shape. What this means is that you use the values that the calculations for skewness and kurtosis has given you, and then provide the name of the type of skew or kurtosis as appropriate. For example:
"The variable has a skewness of 0.94, which indicates that it has a moderate positive skew."
Kurtosis is a measurement of tailedness of a distribution - that is, what the ends of a distribution curve look like. Tailedness indicates the presence of outliers. Tails can, of course, be fat, thin or in the middle: a normal distribution we would say has a normal tailedness.
There exist many formulas to calculate the kurtosis of a distribution, although again it may be more desirable to calculate this using software as opposed to a formula. Kurtosis is often described in terms of 'excessive kurtosis', that is, kurtosis which is different to a value of 3: this is because the normal distribution has a kurtosis of 3. Excessive kurtosis is therefore calculated as
3 - kurtosis.
There are three kinds of kurtosis - mesokurtosis, platykurtosis and leptokurtosis.
Mesokurtic distributions are ones which follow more or less a typical normal distribution, and are distributions with a medium tailedness and therefore aren't said to have any excess tailedness. A distribution with a kurtosis of approximately 3 would be said to be mesokurtic.
Platykurtic distributions are ones which have thin tails. A distribution with a kurtosis of less than 3 (and therefore a negative excessive kurtosis) is platykurtic.
One thing that tends to trip students up with platykurtic distributions is that they should be rather flat on top: that idea comes from the fact that 'platykurtic' is similar to 'plateau', which of course is an elevated level surface! Unfortunately, this is not true here in statistics. Kurtosis is only to do with the tailedness of a distribution, not what it looks like on top.
The uniform and the Weigner semicircle distributions are examples of platykurtic distributions.
Leptokurtic distributions are distributions with fat tails. Distributions with a kurtosis greater than 3 (and therefore have a positive excessive kurtosis) is leptokurtic. These kinds of distributions are more prone to the presence of outliers because the majority of points close to the mean result in a smaller standard deviation.
Examples of leptokurtic distributions are the Student's t, the Laplace and the Poisson distributions.
Is skewness or kurtosis ever a problem?
Skewness can be a problem when it comes to certain statistical concepts, such as inferential statistics, because many concepts have an assumptions that the data is normally distributed. In particular, non-normal data may contain outliers, so it would be worth transforming skewed data to make it more normal in order to use a parametric test, the central limit theorem, linear modelling, and so on. Violating the normality assumptions may produce unreliable results.
Real life data can be skewed so it is important to know what to do in order to make data you have gathered useable. If there is only a slight skew, that is fine and you can use it as it is, however moderately or heavily skewed data should be modified so you can use a parametric test. There are a few ways to transform data, and a few are mentioned here: if any of these do not work, feel free to 'shop around' for alternative ones! Using a linear transformation is perfectly fine to try and manipulate right-skewed data into being more normal.
Moderately right skewed data can be transformed by taking the square root of your data points, and more heavily right skew can be corrected by taking the natural log or the log to the base 10 of the data. Left skewed data can be transformed by reflecting about the point
(maximum vale in the data set) + 1
and then doing the same transformations as listed for right skew. Have a play around with different methods to see what works best.