Simple Linear Regression (SLR) is a statistical test, more specifically a type of F test, which involves one continuous independent variable and one continuous dependent variable. It can be used to investigate both the strength of the relationship between these variables, or the value of the dependent variable at a particular point of the independent variable.
It is a type of linear regression (which is itself a type of regression) where the 'simple' refers to the fact that only a single independent variable is utilised: if you have more than one, you should use multiple linear regression (where the 'multiple' is in reference to multiple independent variables!).
The tabs of this guide will support you in understanding and performing simple linear regressions using various software. The sections are organised as follows:
It is perfectly possible to do a simple linear regression test by hand, however it is more advisable to do this using a statistical software package to reduce the risk of human error, especially if the results are something people will be acting upon.
Regression tests are statistical tests used to investigate the strength and significance of the relationship between one or more independent variable and one dependent variable. In short, they are used to make predictions. The most common type of regression is linear regression, which is a type of regression performed when there exists a linear relationship between the dependent variable and each independent variable.
Regression tests work well with sufficiently large sample sizes, however will still work for sample sizes on the smaller size too, such as n = 30 or below. It is ill-advisable to use a sample size of less than 10, however.
Simple Linear Regression is a type of linear regression involving one continuous independent variable and one continuous dependent variable. The relationship between the two variables can be explained using a linear equation such as
y = + x, or y = + x +
for constants (the intercept) and (the gradient) and (the error term). Notice that this is similar to the straight-line equation
y = mx + c.
Indeed, a linear regression test provides you with the equation of a straight line in terms of the independent and dependent variables at play, which is the lines of best fit through all the data points. This is useful to show trends, or to predict the most likely output from future data or data you don't have which lies between the range you have gathered.
As opposed to the R value, which is the measure of correlation between the independent and dependent variable, the R² value a simple linear regression output is the measure of variance in the dependent variable which is explained by the independent variable, and exists as a measure of how well your test has fitted the data and scenario you have come up with. R² is a number which lies between 0 and 1, given by:
for x the values of the dependent variable, y the values of the independent variable and n the number of observations in the dataset. An R² value close to 1 indicates that the independent and dependent variable have a close relationship to each other, whereas a value close to 0 indicates that the variables do not have such a close relationship. Note that the R² has nothing to do with the significance or the p value.
For example, for a simple linear regression test investigating the cause-and-effect relationship between a person's height and their weight, there may be a significant relationship found, which comes from the fact that the p value was less than the predetermined significance value. However, the R² value may be rather small, meaning that the regression model was not entirely adequate in explaining what causes a change in a person's weight. What this means is that maybe another test could be done with different variables in play, such as the individuals' diet.
Since simple linear regression is a type of F test, the F statistic should be reported, along with the p value and the R² value. The beta coefficients in the straight line equations could also be reported, or simply the equation of the line itself.
Of course, you can mention anything provided by the regression output, such as the standard error of the residuals — it is completely up to you and what you deem valuable to the conclusions you can draw.
Like any statistical test, regression tests can only be performed when certain assumptions are met. Most likely with real data one or more of these assumptions will end up being violated, but there are sometimes ways this can be managed in order to still use simple linear regression.
Each of these assumptions needs to be investigated before the test itself can be performed.
The variables involved in a simple linear regression test need to be continuous (a type of quantitative data which is a measurement). The temperature in degrees Celsius, distance in kilometers, energy in Joules or Kilowatt Hours and intelligence in IQ points are all examples of continuous data as they are measured, not counted. Continuous data can be graphically displayed with a histogram.
Data such as the number of flower species in a field, the number of a rolled die, a person's annual income and the total number of patients participating in a drug trial are all examples of discrete data, also known as 'count' data as they are counted, not measured. Discrete data can be visually represented with a bar chart. If your independent variable is discrete as opposed to continuous, linear regression will still work, but it is not ideal.
Data such as Likert scale data, POLAR4 quintiles and the key stages of England and Wales' national curriculum, despite being commonly denoted by numbers, are categorical and should not be used in simple linear regression.
There should be a linear relationship between the independent and dependent variables. This can be checked by plotting a scatterplot and visually inspecting the linearity: it does not matter if this linearity is positive or negative, as both are valid with linear regression. Linear regression can be performed on linear data, and non-linear regression can be performed on non-linear data, so if your data is no linear you will need to find a non-linear regression test to do instead!
Your data must be independent in order to be used in linear regression. Independence in this case means that the variables do not have an association which may influence the cause-and-effect between them. There are many ways to test for independence, including the Durbin-Watson test, the Chi-Squared test of Independence or using a contingency table.
The residuals you have should be normally distributed, which means that their distribution is symmetric about the mean. If these were to be displayed as a histogram, it should display a bell-shaped curve. There are many other ways to check for normality, such as using a test like Shapiro-Wilk, Kolmogorov-Smirnov, Jarqque-Barre, or Anderson Darling, or via visual inspection via a Q-Q plot or histogram.
Your data does not need to be perfectly symmetrically distributed in order to be used in a parametric test and still produce reliable results: indeed, data with a slight skew to it can be treated as normal, that is, data with a skew between -0.5 and +0.5. Anything either below -0.5 or above 0.5 is too skewed to be considered as normal, and so a transformation should be implemented to the data instead to make it more approximately normal.
Homoscedasticity is in reference to the homogeneity of variance within the dataset. Homoscedastic data is data which has constant (or at least a similar) variance along the line of best fit for every value of x. In other words, if the data was plotted in a scatterplot of fitted values vs residuals, each data point would be more or less the same distance away from the line as each other. Homoscedasticity can be easily identified on a scatterplot, but otherwise you can see it in the ratio of the largest to the smallest variance between each data point being less than 1.5, or by seeing if the standard deviation is more or less equal for all points. Otherwise, you can use a test such as Bartlett's test or the Breusch-Pagan test.
You can use Microsoft Excel to perform a simple linear regression using the Data Analysis Toolpak plug-in — download this from Google if you have not already.
Let's assume we have some (fictional) data from a (fictional) company selling mobile phones over 71 months on the amount they spent on advertising in a month (adjusting for inflation) and the number of phones sold, and we wish to investigate the relationship between the advertising amount and sales.
Both of the variables (amount spent on advertising, and sales) are continuous. We need to see evidence of linearity, which we shall do on a scatterplot. To do this in excel, do the following:
That's it! Excel then creates a scatterplot of the variables.
We can see that there exists a linear relationship between the variables. This can be emphasised by including a line of best fit:
Let's move on to the next assumptions test.
While we have created the scatterplot, we may as well check for homoscedasticity. We can see on the scatterplot that the residuals have more or less equal variance and therefore we can say that the data is homoscedastic as desired.
The assumption of independence will be checked with the Durbin-Watson test, however this is easiest to do when we perform the linear regression test itself, so we will leave this for later.
Therefore, we shall move on and now check for normality. There are many tests to choose from to test for normality, however for ease we shall use a histogram. This is done one variable at a time, and is done by:
Here are the histograms generated for the variables "Number of Mobile Phones Sold (thousands)" and "Amount Spent on Advertising (£k)". Sturges' Rule was implemented to give the number of bins to be eight.
We can see from the shapes of the distributions that these are approximately bell-shaped and therefore can be considered normal.
So, we have checked the assumptions and can determine that we are good to go to do a simple linear regression. We will use the Data Analysis ToolPak.
Our outputs for the Summary and Residuals then appear on a new sheet.
We have not yet done this test, so let's look at it now. We will need to calculate the Durbin Watson test statistic d, given by
for e_t the individual residuals and T the total number of observations. Since our Residual Output observations lie in cells C25 to C94, this can be performed with the excel formula:
=SUMXMY2(C26:C94, C25:C93)/SUMSQ(C25:C94)
which will provide us with the statistic
d = 1.630955127
For a significance level of .05, n=70 observations and k=1 independent variables, the Durbin-Watson table provides a lower bound of 1.583 and an upper bound of 1.641, and since our statistic lies in this boundary, we can conclude that there is not enough evidence to reject the Durbin-Watson null hypothesis and therefore our residuals have no correlation: that is, our variables are independent.
How can we report this output? What do these numbers mean? You can report anything you would like in a regression test, such as the 95% Confidence Interval or the linear equation the test produces, but what you must report is the F statistic and the p-value. The p-value of course is what you need to use to compare to your significance level, in order to determine if you need to reject or fail to reject your null hypothesis. The R² value is a good idea to report as well.
The most important thing you do, however, is tie your results back to the context of your project. Why did you do a linear regression? What do the results mean? What real-life consequence do the results have on your study? Does the R² value mean that more variables need to be considered next time? These are all things to consider when writing your results up and drawing your conclusions.
It is easy to perform a simple linear regression using SPSS. Let's suppose we have a dataset consisting of 40 secondary-school students' GCSE mathematics exam scores and the average number of hours spent browsing social media sites a day, and we wish to investigate the impact social media use has on exam scores.
Data entries can be written directly into SPSS and be ready for use, however if it is saved elsewhere, such as in an Excel file, we will need to then import the dataset from where it is located. In this example, our dataset is saved in Excel, so this will need to be 'brought over' into SPSS by doing the following:
That's it! You have now inputted the data into SPSS.
Our variables are continuous, and the linear relationship between them can be shown using a scatterplot. Scatterplots can be created by:
The scatterplot will then be created in a new window named 'Output1' for you to inspect for linearity. It does not matter if the linear relationship is positive or negative, as long as it is linear! You can include a line of best fit on the scatterplot by double-clicking on the graph and selecting 'Add fit line at total'.
We can also use the scatterplot to check for homogeneity. As long as the distribution is fairly even along the line of best fit, and does not fan outward, that is good enough to conclude homoscedasticity.
The next thing to check on our list is independence. Since our variables are continuous, we shall do this via the Durbin-Watson test, which is structured using the following hypotheses:
H0 - There exists no correlation between the residuals
H1 (or HA) - Correlation exists between the residuals
This test will provide a test statistic for us to analyse – we want this number to be between 1.5 and 2.5 in order to determine no correlation. We shall do this at the same time as our linear regression, so this won't be shown here.
Let's check now for normality. This can be done with a histogram in a similar way we did for a scatterplot, however we shall use a Shapiro-Wilk test for this. This test is structured using the following hypotheses:
H0 - The data is normally distributed
H1 (or HA) - The data is not normally distributed.
We will use the p-value of this test to determine if the data is normal or not. This test is performed by:
One table labelled 'Tests for Normality' will then appear in the Output1 window underneath your scatterplot – this is the results of the Shapiro-Wilk test – and also a histogram for you to visually inspect the distribution. In the table, ignore the Kolmogorov-Smirnov output in favour of the Shapiro-Wilk output and observe the p-value. This is what you compare to your significance value, your α, in order to decide if you reject or fail to reject your null hypothesis. Let's take our significance level to be 0.05. Our p-values shown in the table are greater than this value which results in us failing to reject our null hypothesis, and we can determine that our variable is normally distributed.
Now that we have checked our assumptions, we can now create our null and alternative hypotheses for the regression.
H0 - The number of hours a day spent on social media has no effect on a student's mathematics exam scores.
H1 (or HA) - The number of hours a day spent on social media does have an effect on a student's mathematics exam scores.
With these in mind, we can now perform the test itself.
That's it! There will be five tables generated for us to consider.
The first table is labelled 'Variables Entered/Removed' which only provides a short summary of the variables involved in the regression and is not too interesting.
The second, labelled 'Model Summary', is much more interesting and provides our R² value as well as the results of the Durbin-Watson test. Remember, this value should lie between 1.5 and 2.5 in order for our data to be independent! Our value of 2.471 lies in this range and therefore we conclude that the variables meet the independence assumption.
The third table is the 'ANOVA' results which contain the F statistic. Remember that a regression is a type of F test so the F statistic is needed to report the results of the regression test.
The fourth is the 'Coefficients' table which contains the B coefficients as well as the p-value, which here is labelled as 'Sig'. The B coefficients is what is used to create the regression line equation, and the p-value is again what is used to compare to the significance value in order to determine whether to reject or fail to reject our null hypothesis. Our p-value is .332, which is greater than our significance level of .05, and therefore we fail to reject our null hypothesis, and conclude that there is not enough evidence in this case to suggest that the number of hours has an impact on a secondary school student's GCSE Mathematics exam scores.
The last table is 'Residuals Statistics', which provides a summary of the residuals of the variables involved in the regression test.
So, the simple linear regression test has been performed with all the relevant steps, and the assumptions checked. BUT! This is not the last step! We need to write up the results and tie this back to our research question and the real-life implications these results hold.
So, how do we report what we have?
You can report anything you would like in a regression test, such as the 95% Confidence Interval or the linear equation the test produces, but what you must report is the F statistic and the p-value. The R² value is a good idea to report as well.
The most important thing you do, however, is tie your results back to the context of your project. Why did you do a linear regression? What do the results mean? What real-life consequence do the results have on your study? Does the R² value mean that more variables need to be considered next time? These are all things to consider when writing your results up and drawing your conclusions.