Hands on Machine Learning with a Practical Approach
In this Article/Blog we are going to predict the Survival of a person , for this we'll be using the titanic survival dataset .
Table of contents
- Importing all the necessary libraries
- Importing the dataset
- Analyzing the first 5 entries
- Dropping all the unnecessary columns
- Checking the shape of the dataset
- Label Encoding
- Analyzing that how many Missing Values are present
- Handling Missing Values
- Splitting Features and Target
- Splitting the dataset for Training and Testing parts
- Training the model using DecisionTreeClassifier
- Evaluating the model for the Training data
- Evaluating the model for the Testing data
This project will cover only some basic parts of Machine Learning, and this blog is still incomplete as I have not defined the functions used, such as how and where to use accuracy_score(), train_test_split(), etc. Please feel free to follow and comment here if you want that dataset file .
Your valuable suggestions are cordially welcomed and greatly appreciated.
-Bhaaavre
Importing all the necessary libraries
import numpy as np
import pandas as pd
Importing the dataset
dataset = pd.read_csv('file_path')
Analyzing the first 5 entries
dataset.head()
Dropping all the unnecessary columns
As we can see that many columns in this dataset are not meaningful and we are keeping this model to be beginner friendly so we'll try more no of columns
dataset = dataset.drop['PassengerId','Name', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'SibSp', 'Parch'], axis=1)
Again analyze the data , to check whether the columns are dropped or not
dataset.head()
Checking the shape of the dataset
dataset.shape
This will tell us that how many rows and cols are present in our dataset
Label Encoding
Now as our 'Sex' col contains the string value i.e Male or female so we have to change it into the numeric value as male = 0 and female = 1
dataset['Sex'] = dataset['Sex'].replace({'male' : 0 , 'female' : 1})
Analyzing that how many Missing Values are present
dataset.isnull().sum()
We can see here the 'Age' column contains so many missing values .
Handling Missing Values
So in that missing values we will insert the mean of the other values present inside the whole column.
dataset['Age'] = dataset['Age'].fillna(dataset['Age'].mean())
Again checking that missing values are handled or not
dataset.isnull().sum()
Splitting Features and Target
Here X= features (all the columns except 'Survived' , Y = target
X = dataset.drop(['Survived'], axis=1)
Y = dataset['Survived']
Splitting the dataset for Training and Testing parts
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
Training the model using DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth= 4)
model.fit(X_train,Y_train)
Evaluating the model for the Training data
from sklearn.metrics import accuracy_score
train_prediction = model.predict(X_train)
train_accuracy = accuracy_score(train_prediction, Y_train)
print("Training data accuracy = ", train_accuracy)
Evaluating the model for the Testing data
test_prediction = model.predict(X_test)
test_accuracy = accuracy_score(test_prediction, Y_test)
print("Testing data accuracy = ", test_accuracy)