Model Evaluation — Regression — Mean Squared Error
Building a machine learning model is rewarding, but it does require a lot of dedication and hard work, That’s a fact. But building a model is not everything. The models that we built, need proper evaluation to understand how they will perform on real-world data in production. That’s why we have some set evaluation techniques in place for specific kinds of machine-learning problems. We will discuss one of such techniques today.
Regression Problems
Regression problems are those kinds of problems where our target variable or output is a continuous variable. Like predicting the weather, temperature, salary, etc. And there are some set model evaluation techniques present for such problems. Let’s look at one of them today.
Mean Squared Error (MSE)
In regression problems let’s understand the error part first. We get an actual value(target variable) present in our test data and our regression model will also predict a value for the target variable. Now this predicted value and the actual value will continue to have some differences. These differences are called errors. In certain cases, these differences can be positive, in other cases, they can be negative as well. So to bring all the values into the positive section, we square up the errors. And at the end, we go ahead and find out the mean of all the errors. That’s called mean Squared Error or MSE. Below is given the formula of MSE.
Let’s understand the mean squared error with code example
#importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression as lr
from sklearn.metrics import mean_squared_error as mse
# lets import our dataset , we will work with boston data itself
df = pd.read_csv('BostonHousing.csv')
df.head()
# let's check few details of the data
print(f'the missing value level present in the data is as below {df.isnull().sum()}')
# let's divide our data in x and y for feature and target variable identification
x = df.iloc[:, 0:-1]
y = df.iloc[:, -1]
x
y
# Now we need to scale our features so that it remains in the same scale with standaradscaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x = scaler.fit_transform(x)
# now let's go ahead and do our train test split
xtrain,xtest,ytrain,ytest = tts(x,y,test_size=0.2,random_state=42)
# now let's train our model
lr = lr()
lr.fit(xtrain,ytrain)
lr.score(xtest,ytest)
0.668759493535632
Now that our linear regression model is built, let’s build a prediction variable as ypred
ypred = lr.predict(xtest)
ypred
let us visualize our model prediction and actual values
import seaborn as sns
sns.regplot(x = ytest , y = ypred )
let’s create a list of errors, which is simply subtracting the actual values from the predicted values
changes = list(ypred-ytest)
changes
As we can see through the visualization, there are errors present in the data, and as we have checked through the list called ‘changes’ how scattered our errors are, let's with the help of NumPy, find out our mean squared error.
# squaring all the errors
sq_error_list = []
for i in changes:
sq_error_list.append(np.square(i))
sq_error_list
#now let's find the mean of this squared list
mse_numpy = np.mean(sq_error_list)
print(f'the mean squared error through numpy method is {mse_numpy}')
the mean squared error through numpy method is 24.29111947497352
Now let’s find out the MSE through sklearn metrics as well.
from sklearn.metrics import mean_squared_error
mse_sklearn = mean_squared_error(ytest,ypred)
print(f'the mean squared error through sklearn method is {mse_sklearn}')
the mean squared error through numpy method is 24.29111947497352
Thus, our theory is proved to be correct as in both occasions we got the same result of the mean squared error. Nearly 0 MSE indicates a more accurate model.
The code used in the article is present in my GitHub repository. And if you like my content, you can follow me here on Medium and on LinkedIn. I post AI and data science-related content regularly.