Housing Prices (Kaggle) – Regression in Python

Here is another small project of mine, where I tried to predict housing prices based on for example the overall quality of the house and the year it was built. The data was extracted from the Kaggle Competition: House Prices: Advanced Regression Techniques. My goal with this project was to do some very basic data exploration using correlation and a heatmap in order to select relevant features in the dataset. Finally I was using four different models (Logistic Regression, Linear Regression, Support Vector Machine and Decision Tree) in order to identify which model yields the lowest Root Mean Squared Logarithmic Error.

Step 1 Importing Libraries, Correlation Heatmap of relevant Variables
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
from sklearn import svm
from sklearn import tree
import seaborn as sns
pd.set_option("display.max_columns",40)

df = pd.read_csv("/Users/niklaskuehn/Desktop/Python and Machine Learning/kaggle Datasets/Housing Prices/train.csv")
test = pd.read_csv("/Users/niklaskuehn/Desktop/Python and Machine Learning/kaggle Datasets/Housing Prices/test.csv")

df = df[["OverallQual", "YearBuilt", "YearRemodAdd", "GrLivArea", 
         "TotalBsmtSF", "GarageCars", "GarageArea", "SalePrice"]]

corrmat = df.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corrmat,annot=True, cmap='coolwarm',fmt=".1")
plt.show()

The heatmap is actually something you would rather do before you have selected relevant features, because how do you know that the features that you have selected are any good. I am just doing this in order to show the heatmap of the features that I selected for later use in the models. In the picture below you can see that the SalePrice is highly positively correlated to all features that I selected.

Step 2 Transform Categorical Values and Create Dummies
### The following functions are used in order to create buckets for the underlying data
### so that the regression models can be trained more efficiently
def YearBuilt_categories(df, cut_points, label_names):
    df["YearBuilt"] = df["YearBuilt"].fillna(-0.5)
    df["YearBuilt_categories"] = pd.cut(df["YearBuilt"],cut_points,labels=label_names)
    return df

cut_points1 = [-1,0,1900,1930,1960,1990,2020]
label_names1 = ["Missing", "Very Old", "Old", "Average", "Rather New", "New"]

df = YearBuilt_categories(df,cut_points1,label_names1)

def YearRemodAdd_categories(df, cut_points, label_names):
    df["YearRemodAdd"] = df["YearRemodAdd"].fillna(-0.5)
    df["YearRemodAdd_categories"] = pd.cut(df["YearRemodAdd"],cut_points,labels=label_names)
    return df

cut_points2 = [-1,0,1960,1980,1990,2000,2020]
label_names2 = ["Missing", "Very Old", "Old", "Average", "Rather New", "New"]

df = YearRemodAdd_categories(df,cut_points2,label_names2)

def GrLivArea_categories(df, cut_points, label_names):
    df["GrLivArea"] = df["GrLivArea"].fillna(-0.5)
    df["GrLivArea_categories"] = pd.cut(df["GrLivArea"],cut_points,labels=label_names)
    return df

cut_points3 = [-1,0,1000,2500,4000,5000,6000]
label_names3 = ["Missing", "Very Old", "Old", "Average", "Rather New", "New"]

df = GrLivArea_categories(df,cut_points3,label_names3)

def TotalBsmtSF_categories(df, cut_points, label_names):
    df["TotalBsmtSF"] = df["TotalBsmtSF"].fillna(-0.5)
    df["TotalBsmtSF_categories"] = pd.cut(df["TotalBsmtSF"],cut_points,labels=label_names)
    return df

cut_points4 = [-1,0,500,1500,2500,4000,6500]
label_names4 = ["Missing", "Very Old", "Old", "Average", "Rather New", "New"]

df = TotalBsmtSF_categories(df,cut_points4,label_names4)

def GarageArea_categories(df, cut_points, label_names):
    df["GarageArea"] = df["GarageArea"].fillna(-0.5)
    df["GarageArea_categories"] = pd.cut(df["GarageArea"],cut_points,labels=label_names)
    return df

cut_points5 = [-1,0,200,400,700,1000,1500]
label_names5 = ["Missing", "Very Old", "Old", "Average", "Rather New", "New"]

df = GarageArea_categories(df,cut_points5,label_names5)

def create_dummies(df, column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies], axis=1)
    return df

### For each of the Categorical values, new columns 
### with values of either 0 or 1 are being geneated
df = create_dummies(df, "OverallQual")
df = create_dummies(df, "YearBuilt_categories")
df = create_dummies(df, "YearRemodAdd_categories")
df = create_dummies(df, "GrLivArea_categories")
df = create_dummies(df, "TotalBsmtSF_categories")
df = create_dummies(df, "GarageArea_categories")
df = create_dummies(df, "GarageCars")
Step 3 Select relevant Columns and Implement Models
columns = ['OverallQual_1', 'OverallQual_2', 'OverallQual_3',
 'OverallQual_4', 'OverallQual_5', 'OverallQual_6', 'OverallQual_7',
 'OverallQual_8', 'OverallQual_9', 'OverallQual_10',
 'YearBuilt_categories_Missing', 'YearBuilt_categories_Very Old',
 'YearBuilt_categories_Old', 'YearBuilt_categories_Average',
 'YearBuilt_categories_Rather New', 'YearBuilt_categories_New',
 'YearRemodAdd_categories_Missing', 'YearRemodAdd_categories_Very Old',
 'YearRemodAdd_categories_Old', 'YearRemodAdd_categories_Average',
 'YearRemodAdd_categories_Rather New', 'YearRemodAdd_categories_New',
'GrLivArea_categories_Missing', 'GrLivArea_categories_Very Old',
 'GrLivArea_categories_Old', 'GrLivArea_categories_Average',
 'GrLivArea_categories_Rather New', 'GrLivArea_categories_New',
 'TotalBsmtSF_categories_Missing', 'TotalBsmtSF_categories_Very Old',
 'TotalBsmtSF_categories_Old', 'TotalBsmtSF_categories_Average',
 'TotalBsmtSF_categories_Rather New', 'TotalBsmtSF_categories_New',
 'GarageArea_categories_Missing', 'GarageArea_categories_Very Old',
 'GarageArea_categories_Old', 'GarageArea_categories_Average',
 'GarageArea_categories_Rather New', 'GarageArea_categories_New',
 'GarageCars_0', 'GarageCars_1', 'GarageCars_2', 'GarageCars_3',
 'GarageCars_4']

X = df[columns]
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

###############
# LogisticRegression
###############

lr = LogisticRegression(solver="lbfgs")
lr.fit(X_train, y_train)

predictionsLR = lr.predict(X_test)

print ('RMSLE for Logistic Regression is: \n', np.sqrt(mean_squared_log_error(y_test, predictionsLR)))
print(20*"#")

###############
# LinearRegression
###############

ln = LinearRegression()
ln.fit(X_train, y_train)

predictionsLN = ln.predict(X_test)

print ('RMSLE for Linear Regression is: \n', np.sqrt(mean_squared_log_error(y_test, predictionsLN)))
print(20*"#")

###############
# SupportVectorMachine
###############

clf = svm.SVC(gamma='scale')
clf.fit(X_train, y_train)

predictionsCLF = clf.predict(X_test)

print ('RMSLE for Support Vector Machine is: \n', np.sqrt(mean_squared_log_error(y_test, predictionsCLF)))
print(20*"#")

###############
# DecisionTree
###############

classifier = tree.DecisionTreeClassifier(random_state=42, max_depth=2, min_samples_leaf=5)
classifier.fit(X_train, y_train)

predictionsDecisionTree = classifier.predict(X_test)

print ('RMSLE for Decision Tree is: \n', np.sqrt(mean_squared_log_error(y_test, predictionsDecisionTree)))
print(20*"#")

Using the above features and metrics, these are the results for each model:

  • RMSLE Linear Regression: 0.211
  • RMSLE Logistic Regression: 0.248
  • RMSLE Support Vector Machine: 0.269
  • RMSLE Decision Tree: 0.330

Obviously, tweaking the metrics used for the models will lead to different results. Next steps could be to either change some of the metrics or features that were selected, scale some of the values, for example the SalePrice, or even scaling some of the features instead of creating buckets for example the ground-living area.

Kommentar verfassen

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert