This project uses a logistic regression model to predict passenger survival on the Titanic. The dataset is preprocessed and evaluated with standard machine learning practices using Scikit-Learn, Pandas, and Seaborn.
- The dataset used is a cleaned Excel file:
PreProccessing.Titanic.xlsx
- Target column:
survived
- Removed columns:
name
,ticket
(irrelevant for modeling) - Missing values in features are imputed (numerical: median, categorical: most frequent)
-
Model: Logistic Regression
-
Preprocessing:
- Numerical features (
pclass
,age
,sibsp
,parch
,fare
): median imputation + scaling - Categorical features (
sex
,embarked
): mode imputation + one-hot encoding
- Numerical features (
-
Evaluation:
- Accuracy
- Confusion Matrix
- ROC Curve & AUC
- Classification Report
pip install pandas matplotlib seaborn scikit-learn openpyxl
Place the PreProccessing.Titanic.xlsx
file in the project directory.
python titanic_logistic_regression.py
A heatmap showing true positives, true negatives, false positives, and false negatives.
Plots True Positive Rate vs False Positive Rate. Includes Area Under Curve (AUC) score.
Detailed metrics: precision, recall, F1-score for both classes.
- Built with
ColumnTransformer
andPipeline
- Numerical and categorical data handled separately
- Improves modularity and reproducibility
Accuracy: 0.81
Classification Report:
precision recall f1-score support
0 0.84 0.88 0.86 105
1 0.76 0.70 0.73 74
accuracy 0.81 179
macro avg 0.80 0.79 0.79 179
weighted avg 0.81 0.81 0.81 179
- Clean and readable pipeline-based preprocessing
- Visualizations: Confusion matrix and ROC curve
- Performance metrics for evaluation
- Easy to extend with other models (e.g., SVM, Random Forest)