This project contains a Python script for preprocessing the original Titanic dataset (titanic_original.csv
). The goal is to clean and prepare the data for further analysis or machine learning tasks.
- Input file:
titanic_original.csv
- Output file:
titanic_cleaned.xlsx
(cleaned version saved in Excel format)
-
Drop Unnecessary Columns The following columns were removed as they contain many missing values or are not relevant for modeling:
cabin
boat
body
home.dest
-
Handle Missing Values in Age
- Missing values in the
age
column were filled using the mean age. - All age values were rounded to the nearest integer.
- Missing values in the
-
Fix Missing Embarked Values
- Missing values in the
embarked
column were filled with'S'
, the most frequent port of embarkation.
- Missing values in the
-
Correct Fare Values
- Zero or negative
fare
values were replaced with the mean fare of positive fares only. - Negative fare values were clamped to
0
.
- Zero or negative
-
Remove Duplicates
- Duplicate records were identified and removed from the dataset.
- Cleaned dataset saved as:
titanic_cleaned.xlsx
- Format: Excel (uses
openpyxl
engine)
pip install pandas openpyxl
Ensure the titanic_original.csv
file is in the same directory, then run:
python titanic_cleaning.py
After execution, titanic_cleaned.xlsx
will be generated in the same directory.
- The script is designed to be a lightweight and simple preprocessor for Titanic data.
- It’s easily extensible for further cleaning or feature engineering.