LibGuides@Southampton: Pearson's r Correlation: Maths and Stats

Pearson's r Correlation
Click here to view this guide as a Word document!

Overview of Pearson's r Correlation

Pearson's r correlation is a type of parametric correlation test used to measure the associations between continuous variables which are normally distributed, homogeneous and independent.

Guide contents

The tabs of this guide will support you in understanding and performing Pearson's r. The sections are organised as follows:

Understanding Correlations - understanding what correlations are in statistics.
Pearson's r - the correlation coefficient created by Pearson.
Performing Pearson's r - performing the test using various softwares.

Inferential Statistics

Inferential statistics is a type of statistics based on inference: a hypothesis or guess made from your data. Inferential statistics include tests such as t-tests, ANOVA, Z-tests and correlations.

Correlations are tests which test the level of association between variables. In other words, as one variable goes up, what does the other one do? Does it also go up at the same rate? Go up at a different rate? Does it go down at the same rate or a different rate? Does it do anything at all? Does it do it's own thing? If the second variable either goes up or down in response to the first, there might be a correlation between them.

What is a Correlation?

A correlation is a connection or relationship between two things, variables, items, people, etc.

Some examples of correlations in real life include:

New parents receive a weight chart to make sure their new baby will gain weight at the right rate. The weight chart shows the relationship between a baby's weight and age in months.
A clothing shop put out warmer stock in winter as customers favour warmer clothing in colder weather.
A local council attempts to reduce the local crime rate by increasing the funding for local schools, because they understand that higher education levels are associated with lower crime.

Correlations can be either positive or negative, or strong or weak. An easy way to see if a correlation exists is by plotting the two variables against each other in a scatterplot. You should be able to see if the relationship is linear, what the slope of the relationship is and if there are any outliers. If you are using software to build this scatterplot then you can also easily add a line of best fit to show the overall trend between the variables.

Correlations may exist because the variables have influence over each other, because there is another variable which causes both, or be a complete coincidence! When this happens, this is called a 'Spurious correlation', which is a correlation which appears significant but have no actual relationship. How can you tell if a correlation is spurious? Use common sense! Remember, correlation ≠ causation.

Correlation Coefficients

In order to measure the strength and direction of a correlational relationship between variables, we need to use a correlation coefficient, which is a number used to represent the correlation. It indicates the strength and direction of the relationship. This is a number between -1 and 1. A negative correlation coefficient indicates a negative relationship between variables, whereas a positive number is indicative of a positive relationship.

As a general rule, you can interpret correlation coefficients as such:

0.7 < r ≤ 1 strong positive correlation
0.4 < r ≤ 0.7 moderate positive correlation
0 < r ≤ 0.4 weak positive correlation
r = 0 no correlation
-0.4 ≤ r <0 weak negative correlation
-0.7 ≤ r < -0.4 moderate negative correlation
-1 ≤ r < -0.7 strong negative correlation

Therefore, the closer the number is to 1 or -1, the stronger the correlation and the more closely the variables are correlated, and the closer the number is to 0, the weaker the correlation and the less likely the variables have something in common.

There are various types of correlation coefficients, including Pearson's correlation coefficient and Kendall's tau coefficient. Pearson's correlation coefficient, known as r, is used when the variables are continuous (or at the very leas interval), and the relationship between variables is linear. Kendall's tau on the other hand can be used for any positive or negative relationship.

How Not to Interpret Correlations

Correlations are relationships between variables. What correlations do not do is hypothesize any cause-and-effect relationship. Of course, two variables may be correlated because one causes the other, but that can only be determined using further research, and not with correlations alone.

Always remember that correlations are not the same as causations.

How does Pearson's r Work?

Pearson's r was created by Karl Pearson, a British statistician in the early 20th century and is used to measure the strength and direction of linear relationships between two continuous variables, however it also works well for interval data.

When continuous data is collected from a sample and plotted in a scatterplot to reveal linearity, a correlation can be performed to measure the type of relationship that exists between the variables. Recall that correlation coefficients can be interpreted as follows:

0.7 < r ≤ 1 strong positive correlation
0.4 < r ≤ 0.7 moderate positive correlation
0 < r ≤ 0.4 weak positive correlation
r = 0 no correlation
-0.4 ≤ r <0 weak negative correlation
-0.7 ≤ r < -0.4 moderate negative correlation
-1 ≤ r < -0.7 strong negative correlation

and this is no different from r.

Assumptions

This test is a parametric test and therefore follow strict assumptions in order to obtain accurate and reliable measurements:

Data should be continuous or interval
A linear relationship between the variables should exist - this can be checked with a scatterplot
No significant outliers should exist in the data
The data should be approximately normally distributed
The data should be homoscedastic
The variables need to be independent of each other

Calculating r

Pearson's product-moment correlation, also known as Pearson's r, is a parametric correlation used with interval data when the relationship between the variables is linear. It is a type of correlation test which means it looks for the strength and direction of the relationship between the variables involved. r is the correlation coefficient in this test and is calculated by:

$r equals the fraction the sum of i from one to n of open x i minus x bar multiplied by y i minus y bar all over the square root of the sum of i from one to n of the square of x i minus x bar times the sum of i from one to n of the square of y i minus y bar$

where:

r is Pearson's correlation coefficient
n is the number of data points
x_i represent each data point in one variable, 0 ≤ i ≤ n
is the mean of that variable
y_i represents each data point in the other variable, 0 ≤ i ≤ n
is the mean of that other variable.

However, r is not often calculated by hand - this should be done using software instead.

Performing Pearson's r

Pearson's r, like most (if not all) inferential statistics test at University, should be performed with software to reduce the risk of human error. Below are some ways of calculating Pearson's r using various softwares.

Below, we have used x and y to represent variables and dataset as the name of the dataset being used, but of course when it comes to your own data analysis you should use the names of your own variables and dataset!

Make sure you have checked the assumptions before calculating r.

Excel

In Excel you can simply use the built-in correlation function in excel to calculate the correlation coefficient:

=PEARSON(A:A, B:B) or if your data lies in a specific range (say, from rows 2 to 40) =PEARSON(A2:A40, B2:B40)

if your variables in question are in columns A and B respectively.

Alternatively, if you have the Analysis ToolPak installed in Excel, you can go to the Data tab and click Data Analysis over to the right to bring up the menu. Then, select Correlation and a new pop-up will appear: you will need to write the input range into the input range box - for this example, let's say our variables are in columns A and B and the data points are in rows 2 to 40. We will need to input this range into that box.

In the Grouped By option make sure Columns is selected, and below that that the Labels in first row box is ticked. In the Output options box, you can choose whether your correlation output should be in a specific cell in your current worksheet, in a new sheet or in a new workbook entirely.

The correlations box inputted with input range $A$2:$B$40, grouped by columns, labels in first row and output in a new worksheet ply

Clicking OK afterwards will provide a correlation matrix of the variables.

MATLAB

For variables x and y you can use the code:

R = corrcoef(x, y);

R

to calculate the coefficient, or:

[R, P] = corrcoef(x, y)

to create a matrix of coefficients and p-values if your variables are in matrix form.

R

After importing your data into R, and coding your variables x and y as appropriate in your dataset named dataset, you can use the following code to calculate Pearson's r:

cor(dataset$x, dataset$y, method = 'pearson')

SAS

Pearson's r is performed in SAS with the following code, where x and y represent your variables and your dataset is named dataset:

proc corr data = dataset;

var x y;

run;

SPSS

Go to Analyze at the top of the screen and choose Correlate and then Bivariate. In the pop-up window, click-and-drag your relevant variables to be tested into the 'Variables' box, or alternatively click them so they are highlighted and then use the arrow to send them across. Make sure the Pearson checkbox is ticked and then click Options: in the new pop-up window, make sure that in the Missing Values box the Exclude cases pairwise option is selected, and then click Continue. When you return to the previous pop-up window, click OK to generate a matrix of results.

Teach Yourself Statistics

Maths and Statistics Home

Academic Skills Home