# Simple Linear Regression: Maths and Stats

### Overview of Simple Linear Regression

Simple Linear Regression (SLR) is a statistical test, more specifically a type of F test, which involves one continuous independent variable and one continuous dependent variable. It can be used to investigate both the strength of the relationship between these variables, or the value of the dependent variable at a particular point of the independent variable.

It is a type of linear regression (which is itself a type of regression) where the 'simple' refers to the fact that only a single independent variable is utilised: if you have more than one, you should use multiple linear regression (where the 'multiple' is in reference to multiple independent variables!).

### Guide contents

The tabs of this guide will support you in understanding and performing simple linear regressions using various software. The sections are organised as follows:

• How Simple Linear Regression Works - how SLR works and what you need to understand about it
• Assumptions - the assumptions you need to test to determine if you can do SLR
• Example in Excel - a worked example using Microsoft Excel
• Example in SPSS - a worked example using SPSS software
• Example in R - a worked example using R Studio

It is perfectly possible to do a simple linear regression test by hand, however it is more advisable to do this using a statistical software package to reduce the risk of human error, especially if the results are something people will be acting upon.

### What is Regression?

Regression tests are statistical tests used to investigate the strength and significance of the relationship between one or more independent variable and one dependent variable. In short, they are used to make predictions. The most common type of regression is linear regression, which is a type of regression performed when there exists a linear relationship between the dependent variable and each independent variable.

Regression tests work well with sufficiently large sample sizes, however will still work for sample sizes on the smaller size too, such as n = 30 or below. It is ill-advisable to use a sample size of less than 10, however.

### What is Simple Linear Regression?

Simple Linear Regression is a type of linear regression involving one continuous independent variable and one continuous dependent variable. The relationship between the two variables can be explained using a linear equation such as

y = +  x,    or    y =  + x +

for constants  (the intercept) and  (the gradient) and  (the error term). Notice that this is similar to the straight-line equation

y = mx + c.

Indeed, a linear regression test provides you with the equation of a straight line in terms of the independent and dependent variables at play, which is the lines of best fit through all the data points. This is useful to show trends, or to predict the most likely output from future data or data you don't have which lies between the range you have gathered.

### R², the Coefficient of Determination

As opposed to the R value, which is the measure of correlation between the independent and dependent variable, the R² value a simple linear regression output is the measure of variance in the dependent variable which is explained by the independent variable, and exists as a measure of how well your test has fitted the data and scenario you have come up with. R² is a number which lies between 0 and 1, given by:

for x the values of the dependent variable, y the values of the independent variable and n the number of observations in the dataset. An R² value close to 1 indicates that the independent and dependent variable have a close relationship to each other, whereas a value close to 0 indicates that the variables do not have such a close relationship. Note that the R² has nothing to do with the significance or the p value.

For example, for a simple linear regression test investigating the cause-and-effect relationship between a person's height and their weight, there may be a significant relationship found, which comes from the fact that the p value was less than the predetermined significance value. However, the R² value may be rather small, meaning that the regression model was not entirely adequate in explaining what causes a change in a person's weight. What this means is that maybe another test could be done with different variables in play, such as the individuals' diet.

#### What to Report With a Simple Linear Regression

Since simple linear regression is a type of F test, the F statistic should be reported, along with the p value and the R² value. The beta coefficients in the straight line equations could also be reported, or simply the equation of the line itself.

Of course, you can mention anything provided by the regression output, such as the standard error of the residuals — it is completely up to you and what you deem valuable to the conclusions you can draw.

### Regression Assumptions

Like any statistical test, regression tests can only be performed when certain assumptions are met. Most likely with real data one or more of these assumptions will end up being violated, but there are sometimes ways this can be managed in order to still use simple linear regression.

Each of these assumptions needs to be investigated before the test itself can be performed.

#### Continuous Variables

The variables involved in a simple linear regression test need to be continuous (a type of quantitative data which is a measurement). The temperature in degrees Celsius, distance in kilometers, energy in Joules or Kilowatt Hours and intelligence in IQ points are all examples of continuous data as they are measured, not counted. Continuous data can be graphically displayed with a histogram.

Data such as the number of flower species in a field, the number of a rolled die, a person's annual income and the total number of patients participating in a drug trial are all examples of discrete data, also known as 'count' data as they are counted, not measured. Discrete data can be visually represented with a bar chart. If your independent variable is discrete as opposed to continuous, linear regression will still work, but it is not ideal.

Data such as Likert scale data, POLAR4 quintiles and the key stages of England and Wales' national curriculum, despite being commonly denoted by numbers, are categorical and should not be used in simple linear regression.

#### Linearity

There should be a linear relationship between the independent and dependent variables. This can be checked by plotting a scatterplot and visually inspecting the linearity: it does not matter if this linearity is positive or negative, as both are valid with linear regression. Linear regression can be performed on linear data, and non-linear regression can be performed on non-linear data, so if your data is no linear you will need to find a non-linear regression test to do instead!

#### Independence

Your data must be independent in order to be used in linear regression. Independence in this case means that the variables do not have an association which may influence the cause-and-effect between them. There are many ways to test for independence, including the Durbin-Watson test, the Chi-Squared test of Independence or using a contingency table.

#### Normality

The residuals you have should be normally distributed, which means that their distribution is symmetric about the mean. If these were to be displayed as a histogram, it should display a bell-shaped curve. There are many other ways to check for normality, such as using a test like Shapiro-Wilk, Kolmogorov-Smirnov, Jarqque-Barre, or Anderson Darling, or via visual inspection via a Q-Q plot or histogram.

Your data does not need to be perfectly symmetrically distributed in order to be used in a parametric test and still produce reliable results: indeed, data with a slight skew to it can be treated as normal, that is, data with a skew between -0.5 and +0.5. Anything either below -0.5 or above 0.5 is too skewed to be considered as normal, and so a transformation should be implemented to the data instead to make it more approximately normal.

#### Homoscedasticity

Homoscedasticity is in reference to the homogeneity of variance within the dataset. Homoscedastic data is data which has constant (or at least a similar) variance along the line of best fit for every value of x. In other words, if the data was plotted in a scatterplot of fitted values vs residuals, each data point would be more or less the same distance away from the line as each other. Homoscedasticity can be easily identified on a scatterplot, but otherwise you can see it in the ratio of the largest to the smallest variance between each data point being less than 1.5, or by seeing if the standard deviation is more or less equal for all points. Otherwise, you can use a test such as Bartlett's test or the Breusch-Pagan test.

### Example in Excel

You can use Microsoft Excel to perform a simple linear regression using the Data Analysis Toolpak plug-in — download this from Google if you have not already.

Let's assume we have some (fictional) data from a (fictional) company selling mobile phones over 71 months on the amount they spent on advertising in a month (adjusting for inflation) and the number of phones sold, and we wish to investigate the relationship between the advertising amount and sales.

#### Checking the Assumptions

Both of the variables (amount spent on advertising, and sales) are continuous. We need to see evidence of linearity, which we shall do on a scatterplot. To do this in excel, do the following:

• Highlight both columns in the spreadsheet containing the variables we need.
• Go to the Insert tab at the top of the page and click the Scatterplot button in the 'Charts' group. This is the button with an x- and y-axis and dots in the space.
• Select the first scatterplot in the drop-down menu, named 'Scatter'.

That's it! Excel then creates a scatterplot of the variables.

We can see that there exists a linear relationship between the variables. This can be emphasised by including a line of best fit:

• Right-click any of the data points plotted on the scatterplot
• On the panel to the right which appears, select 'Linear'.
• Click the X button at the top of the panel to close it again.

Let's move on to the next assumptions test.

While we have created the scatterplot, we may as well check for homoscedasticity. We can see on the scatterplot that the residuals have more or less equal variance and therefore we can say that the data is homoscedastic as desired.

The assumption of independence will be checked with the Durbin-Watson test, however this is easiest to do when we perform the linear regression test itself, so we will leave this for later.

Therefore, we shall move on and now check for normality. There are many tests to choose from to test for normality, however for ease we shall use a histogram. This is done one variable at a time, and is done by:

• Highlight the relevant data.
• Go to the Insert tab at the top and select 'Insert Statistic Chart', which is the chart conveniently looking like a histogram
• Select the first option.
• Visually compare the distribution to the classic bell-shaped curve - if it is more or less bell-shaped, you can determine that the distribution is normal.

Here are the histograms generated for the variables "Number of Mobile Phones Sold (thousands)" and "Amount Spent on Advertising (£k)". Sturges' Rule was implemented to give the number of bins to be eight.

We can see from the shapes of the distributions that these are approximately bell-shaped and therefore can be considered normal.

#### Performing the Test

So, we have checked the assumptions and can determine that we are good to go to do a simple linear regression. We will use the Data Analysis ToolPak.

• Go to the Data tab at the top and click Data Analysis. A new window will open.
• Scroll down and select Regression, then click OK. A new window will open.
• The 'Input Y Range' box needs to be filled with the range of data for the dependent variable, and the 'Input X Range' box needs to be filled with that for the independent variable.
• Make sure that the Labels and Residuals boxes are checked.
• If you wish, you can make sure that the Confidence Level box is checked. You can set this percentage to be whatever you want, but the most common is 95%.
• Click OK.

Our outputs for the Summary and Residuals then appear on a new sheet.

#### Durbin-Watson Test

We have not yet done this test, so let's look at it now. We will need to calculate the Durbin Watson test statistic d, given by

for e_t the individual residuals and T the total number of observations. Since our Residual Output observations lie in cells C25 to C94, this can be performed with the excel formula:

`=SUMXMY2(C26:C94, C25:C93)/SUMSQ(C25:C94)`

which will provide us with the statistic

d = 1.630955127

For a significance level of .05, n=70 observations and k=1 independent variables, the Durbin-Watson table provides a lower bound of 1.583 and an upper bound of 1.641, and since our statistic lies in this boundary, we can conclude that there is not enough evidence to reject the Durbin-Watson null hypothesis and therefore our residuals have no correlation: that is, our variables are independent.

#### Reporting the Results

How can we report this output? What do these numbers mean? You can report anything you would like in a regression test, such as the 95% Confidence Interval or the linear equation the test produces, but what you must report is the F statistic and the p-value. The p-value of course is what you need to use to compare to your significance level, in order to determine if you need to reject or fail to reject your null hypothesis. The R² value is a good idea to report as well.

The most important thing you do, however, is tie your results back to the context of your project. Why did you do a linear regression? What do the results mean? What real-life consequence do the results have on your study? Does the R² value mean that more variables need to be considered next time? These are all things to consider when writing your results up and drawing your conclusions.

### Example in SPSS

It is easy to perform a simple linear regression using SPSS. Let's suppose we have a dataset consisting of 40 secondary-school students' GCSE mathematics exam scores and the average number of hours spent browsing social media sites a day, and we wish to investigate the impact social media use has on exam scores.

#### Optional: Inputting Excel Data into SPSS

Data entries can be written directly into SPSS and be ready for use, however if it is saved elsewhere, such as in an Excel file, we will need to then import the dataset from where it is located. In this example, our dataset is saved in Excel, so this will need to be 'brought over' into SPSS by doing the following:

• Click FileImport Data and Excel.
• Select Excel in the 'Files of Type' drop-down menu.
• Choose the relevant excel file to import.
• Make sure that the tick-box 'Read Variable Names' is selected.
• Click OK.

That's it! You have now inputted the data into SPSS.

### Checking the Assumptions

Our variables are continuous, and the linear relationship between them can be shown using a scatterplot. Scatterplots can be created by:

• Click the Graphs tab at the top of the page and select Chart Builder.
• In the Gallery pane in the bottom half of the pop-up window, select Scatter/Dot and click and drag the first picture into the blank space at the top half of the pop-up.
• Click and drag your variables to display in the scatterplot into the x- and y-axis boxes respectively.
• Click OK.

The scatterplot will then be created in a new window named 'Output1' for you to inspect for linearity. It does not matter if the linear relationship is positive or negative, as long as it is linear! You can include a line of best fit on the scatterplot by double-clicking on the graph and selecting 'Add fit line at total'.

We can also use the scatterplot to check for homogeneity. As long as the distribution is fairly even along the line of best fit, and does not fan outward, that is good enough to conclude homoscedasticity.

The next thing to check on our list is independence. Since our variables are continuous, we shall do this via the Durbin-Watson test, which is structured using the following hypotheses:

H0 - There exists no correlation between the residuals

H1 (or HA) - Correlation exists between the residuals

This test will provide a test statistic for us to analyse – we want this number to be between 1.5 and 2.5 in order to determine no correlation. We shall do this at the same time as our linear regression, so this won't be shown here.

Let's check now for normality. This can be done with a histogram in a similar way we did for a scatterplot, however we shall use a Shapiro-Wilk test for this. This test is structured using the following hypotheses:

H0 - The data is normally distributed

H1 (or HA) - The data is not normally distributed.

We will use the p-value of this test to determine if the data is normal or not. This test is performed by:

• Click the Analyze tab at the top of the page and select Descriptive Statistics and then Explore.
• In the Explore pop-up window, click and drag your variables into the row(s) and column(s) boxes respectively.
• Click the Plots button on the right hand side of the pop-up.
• Make sure that the tick boxes 'Histogram' and 'Normality plots with tests' is selected and click Continue. That window will then disappear.
• Click OK.

One table labelled 'Tests for Normality' will then appear in the Output1 window underneath your scatterplot – this is the results of the Shapiro-Wilk test – and also a histogram for you to visually inspect the distribution. In the table, ignore the Kolmogorov-Smirnov output in favour of the Shapiro-Wilk output and observe the p-value. This is what you compare to your significance value, your α, in order to decide if you reject or fail to reject your null hypothesis. Let's take our significance level to be 0.05. Our p-values shown in the table are greater than this value which results in us failing to reject our null hypothesis, and we can determine that our variable is normally distributed.

#### Performing the Test

Now that we have checked our assumptions, we can now create our null and alternative hypotheses for the regression.

H0 - The number of hours a day spent on social media has no effect on a student's mathematics exam scores.

H1 (or HA) - The number of hours a day spent on social media does have an effect on a student's mathematics exam scores.

With these in mind, we can now perform the test itself.

• Click the Analyze tab at the top of the page and select Regression and then Linear.
• In the Linear Regression pop-up window, click and drag your dependent variable into the Dependent box and your independent variable into the Independ(s) boxes as appropriate.
• Click the Statistics button on the right hand side of the pop-up.
• Make sure that the tick boxes 'Confidence Interval' and 'Durbin-Watson' are ticked in the new pop-up window. The Durbin-Watson is what we will use for our independence check!
• Click Continue to make the pop-up disappear.
• Click OK.

That's it! There will be five tables generated for us to consider.

The first table is labelled 'Variables Entered/Removed' which only provides a short summary of the variables involved in the regression and is not too interesting.

The second, labelled 'Model Summary', is much more interesting and provides our R² value as well as the results of the Durbin-Watson test. Remember, this value should lie between 1.5 and 2.5 in order for our data to be independent! Our value of 2.471 lies in this range and therefore we conclude that the variables meet the independence assumption.

The third table is the 'ANOVA' results which contain the F statistic. Remember that a regression is a type of F test so the F statistic is needed to report the results of the regression test.

The fourth is the 'Coefficients' table which contains the B coefficients as well as the p-value, which here is labelled as 'Sig'. The B coefficients is what is used to create the regression line equation, and the p-value is again what is used to compare to the significance value in order to determine whether to reject or fail to reject our null hypothesis. Our p-value is .332, which is greater than our significance level of .05, and therefore we fail to reject our null hypothesis, and conclude that there is not enough evidence in this case to suggest that the number of hours has an impact on a secondary school student's GCSE Mathematics exam scores.

The last table is 'Residuals Statistics', which provides a summary of the residuals of the variables involved in the regression test.

#### Reporting the Results

So, the simple linear regression test has been performed with all the relevant steps, and the assumptions checked. BUT! This is not the last step! We need to write up the results and tie this back to our research question and the real-life implications these results hold.

So, how do we report what we have?

You can report anything you would like in a regression test, such as the 95% Confidence Interval or the linear equation the test produces, but what you must report is the F statistic and the p-value. The R² value is a good idea to report as well.

The most important thing you do, however, is tie your results back to the context of your project. Why did you do a linear regression? What do the results mean? What real-life consequence do the results have on your study? Does the R² value mean that more variables need to be considered next time? These are all things to consider when writing your results up and drawing your conclusions.