Titanic (Kaggle) – Machine Learning in Python

This was actually one of the very first projects that I did for exploring Machine Learning in Python. My primary goal was to get a basic understanding about how Machine Learning works, all the way from basic data exploration, how to select reasonable variables, encoding categorial variables such as sex, all the way up to clustering passengers in different age categories to enhance the learning process.

All the data was taken from Titanic: Machine Learning from Disaster (Kaggle Competition)

Step 1 Importing Libraries as well as train and test set
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import seaborn as sns

test = pd.read_csv("/Users/niklaskuehn/Desktop/Python and Machine Learning/kaggle Datasets/titanic/test.csv")
train = pd.read_csv("/Users/niklaskuehn/Desktop/Python and Machine Learning/kaggle Datasets/titanic/train.csv")
Step 2 Some basic Data Exploration
correl = train.corr()

### Create a Seaborn Heatmap that visualizes the correlation of the different variables in the dataset
plt.figure(figsize=(12,8))
sns.heatmap(correl,annot=True, cmap='coolwarm',fmt=".1")
plt.show()

### Show the distribution of Age and Passenger Class
Class1 = train[ train["Pclass"] == 1]
Class2 = train[ train["Pclass"] == 2]
Class3 = train[ train["Pclass"] == 3]

Class1["Age"].plot.hist(alpha=0.5, color="red", bins=50)
Class2["Age"].plot.hist(alpha=0.5, color="blue", bins=50)
Class3["Age"].plot.hist(alpha=0.5, color="green", bins=50)
plt.show()

### Show the distribution of Age and Survived
Survived1 = train[train["Survived"] == 1]
Survived0 = train[train["Survived"] == 0]

Survived1["Age"].plot.hist(alpha=0.5, color="blue", bins=50)
Survived0["Age"].plot.hist(alpha=0.5, color="red", bins=50)
plt.show()

Step 3 Prepare the data for Machine Learning, in this case LogisticRegression
def age_categories(df, cut_points, label_names): ### Function that returns a new column for every label, depending on the cut points
    df["Age"] = df["Age"].fillna(-0.5)
    df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
    return df

cut_points = [-1,0,5,10,20,30,50,100] ### Clustering the Age into different categories, so it is easier for the model to train
label_names = ["Missing", "Child", "Youth", "Teenager", "Young Adult", "Adult", "Old"]

train = age_categories(train,cut_points,label_names)


def create_dummies(df, column_name): ### Function that creates new columns with either 0 or 1 
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies], axis=1)
    return df

train = create_dummies(train, "Age_categories")
train = create_dummies(train, "Sex")
train = create_dummies(train, "Pclass")
Step 4 Logistic Regression and Evaluation
lr = LogisticRegression(solver="lbfgs")

### Define columns on which the model should be trained:
columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
       'Age_categories_Missing','Age_categories_Child',
       'Age_categories_Youth', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Old']

### Preparing the data to split them into a training and into a test
### set while using the sklearn.model_selection train_test_split function
X = train.drop(["Survived", "Name", "Sex", "Ticket","Age_categories"], axis=1)
y = train["Survived"]

all_X = train[columns]
all_y = train["Survived"]

train_X, test_X, train_y, test_y = train_test_split(all_X, all_y, test_size=0.2, random_state=0)

### Fit the Logistic Regression to the target we wish to predict
lr.fit(train_X, train_y)

### Store the predicitons made on the test_X dataset into a new variable so we can check the accurracy using
### sklear_metrics accuracy_score
predictions = lr.predict(test_X)

### Finding out the accuracy score
accuracy = accuracy_score(test_y, predictions)
print(accuracy)

scores = cross_val_score(lr, all_X, all_y, cv=10)
print(np.mean(scores))

Using the above metrics, I was able to get an accuracy score of about 0.796. Or in other words, given a specific age, sex and passenger class, I was able to predict whether a passenger was able to survive the titanic disaster or not with an accuracy of about 80 percent.

Kommentar verfassen

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert