Predictiong missing values with Linear Regression

3 min readJun 2, 2021

Previously (on < insert your tv show here.. >… kidding), I posted the article “Four ways to handle missing data with pandas.”

In the end, I gave a spoiler (am I watching too much Netflix?) about the next episode of my Medium, and here it is!

Today we will learn how to predict missing values using machine learning (more precisely: linear regression)

Codes are available at GitHub.

In this example, Melbourn Housing Snapshot is used. You can download it from Kaggle. Supposing you have the files, let’s start:

Checking the data

It is important to know and understand and know the data. Therefore, we can use function head to check the types and columns in the dataset:

data.head()

Transforming the data

Although some machine learning methods can deal with categorical data (strings, dates, among others), this is not the case of linear regression.

Therefore, one crucial step is to transform all categorical attributes into numbers. The code below uses pandas to perform the required action:

data.Address = pd.Categorical(data.Address).codes
data.Suburb = pd.Categorical(data.Suburb).codes
data.Type = pd.Categorical(data.Type).codes
data.Method = pd.Categorical(data.Method).codes
data.SellerG = pd.Categorical(data.SellerG).codes
data.Date = pd.Categorical(data.Date).codes
data.CouncilArea = pd.Categorical(data.CouncilArea).codes
data.Regionname = pd.Categorical(data.Regionname).codes

Separate columns

In this example, I will use only columns that do not contain null values.

To do this automatically, we can use the isnull() function aligned with mean() (sum() would work as well).

dataNull = data[data.columns[data.isnull().mean() > 0]]
dataNotNull = data[data.columns[data.isnull().mean() == 0]]

Training and test sets

Yes, we could use k-fold or any other validation step to improve the confidence in the results. Also, we could use correlation analysis to choose the columns that are more related to the ones we decided to predict.

Yes, additions would be great, and I suggest you to keep them in mind for a further project.

However, for the sake of simplicity, I will keep it basic for this post.

Furthermore, in this example, we will predict the BuildingArea column:

#we will use only non missing values to create the training set (features and target)
x_training = dataNotNull[~data['BuildingArea'].isnull()]#dataNull is the set of columns that contains missing values. BuildingArea is one of them. However, we select the rows that have values.
y_train = dataNull[~dataNull['BuildingArea'].isnull()]['BuildingArea']

The final step!

Finally, we reach the point where we can use the machine learning method. If you are a beginner, notice how much time we spent on data transformation. Yeah, welcome to my life!

Like in many projects, here, we will use the sklearn.linear_model library to retrieve the model and use training set information to calibrate the model. Then, missing values from BuildingArea are computed and assigned to the original dataframe.

#importing the model
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()#training the model
model.fit(x_training, y_train)#predicting the missing values and assign them to the correct place
data.loc[data.BuildingArea.isnull(), 'BuildingArea'] = model.predict(x_test)#you can print the dataset to check if it worked
print(data)#and check if there is any missing value in the column. Result must be 0
print(data.BuildingArea.isnull().mean())

DONE!

Now, it is your time to continue the example. You can use validation and/or correlation to make it more robust. Also, you can extend the idea to other numeric columns.

Comment if you have other ideas for posts, and tell me your opinion about this example!