This project implements a comprehensive machine learning pipeline for predicting salary ranges from job postings. Using advanced NLP techniques and ensemble learning methods, we achieve highly accurate salary predictions based on job descriptions, titles, required skills, and location data.
The repository contains two Jupyter notebooks analyzing different job posting datasets:
salary_prediction_linkedin.ipynb
- Analyzes LinkedIn job postings datasalary_prediction_glassdoor.ipynb
- Analyzes Glassdoor job postings data
This dataset includes around 75,000 job listings from various industries and locations, with about 21,000 listings containing clear salary information. Each posting offers a detailed job description, title, and list of requirements. This rich and diverse text data lets us examine how the specifics in job descriptions correlate with the salaries offered. By focusing on real, detailed examples, this dataset provides a strong foundation for improving salary predictions beyond just using job titles and years of experience.
The Glassdoor dataset, collected in 2017, offers historical job market data along with detailed job descriptions. Although it is a bit older, adjusting the salary figures for inflation makes the data relevant today. This dataset adds another layer of insight into salary benchmarks and job description quality over time. Using both current and historical data helps us build a model that understands trends and changes in the job market, making our predictions more robust and reliable.
- Text Processing: Advanced TF-IDF vectorization of job descriptions and titles
- Feature Engineering:
- Extraction of years of experience requirements
- Education level detection
- Seniority classification from job titles
- Geographic region encoding and tech hub indicators
- Model Evaluation: Comprehensive metrics including adjusted MAE, within-10% accuracy, and R²
- Error Analysis: Detailed examination of extreme prediction errors
- Calibration Analysis: Bias-variance analysis across salary ranges
- Visualization: Extensive plotting of model performance and feature importance
- Keyword Extraction: Custom regex patterns to extract years of experience (e.g., "5+ years experience")
- Education Requirements: Pattern matching to detect education level requirements (Bachelor's, Master's, PhD)
- Seniority Classification: Rule-based classification from job titles (Junior, Senior, Manager, Director, etc.)
- Location Processing: Conversion of locations to geographic regions and tech hub indicators
- Skill Identification: Boolean features for technical skills like Python, R, AWS, etc.
- Company Age: Normalized company age extracted and used as a predictive feature
- Job Title Vectorization: Fine-tuned TF-IDF specifically for the shorter Glassdoor job titles
- Industry Mapping: Extraction and encoding of industry information from job descriptions
- Compensation Components: Analysis of different compensation types (base, bonus, stock options)
- Employment Type: Categorical encoding of full-time, contract, and part-time positions
Model | Adjusted MAE ($) | Within 10% (%) | R² |
---|---|---|---|
XGBoost (n=100, d=6, lr=0.1) | 7,404.48 ± 1,173.61 | 70.79 ± 2.80 | 0.7244 ± 0.0654 |
LightGBM (n=100, lvs=63, lr=0.1) | 7,512.36 ± 1,076.48 | 68.75 ± 3.42 | 0.7163 ± 0.0657 |
LightGBM (n=100, lvs=31, lr=0.1) | 7,512.36 ± 1,076.48 | 68.75 ± 3.42 | 0.7163 ± 0.0657 |
Ridge Regression (α=0.1) | 8,121.37 ± 955.18 | 63.45 ± 1.75 | 0.6920 ± 0.0543 |
Random Forest (n=100, d=20) | 8,463.96 ± 1,145.45 | 51.63 ± 3.63 | 0.6857 ± 0.0665 |
LinkedIn Dataset:
- Higher accuracy for technical roles (tech industry jobs)
- Better performance in the $100K-$200K range
- Years of experience and education level features showed significant predictive power
- Tech hub indicators were highly significant predictive features
Glassdoor Dataset:
- Better performance for non-technical roles
- More accurate in the lower salary ranges ($50K-$100K)
- Company age emerged as an important feature
- Employment type (full-time vs. contract) showed high predictive value
Our best model (XGBoost) showed the following error characteristics:
- Mean absolute error: $7,404.48
- 70.79% of predictions within 10% of actual value
- Error distribution:
- Under-predictions: 70.3% of extreme errors
- Over-predictions: 29.7% of extreme errors
- Salary range error distribution:
- $50K-$100K: 24.3% of extreme errors
- $100K-$150K: 8.1% of extreme errors
- $150K-$200K: 21.6% of extreme errors
-
$200K: 45.9% of extreme errors
The model shows excellent calibration in the $80K-$150K range, with increasing bias at the extremes:
- Slight over-prediction for salaries below $80K (positive bias)
- High accuracy in middle ranges ($80K-$150K)
- Under-prediction for very high salaries above $200K (negative bias)
- Implementation: Scikit-learn's LinearRegression
- Performance: Poor performance with extreme overfitting
- Limitations: Unable to capture non-linear relationships in salary data
- Implementation: Scikit-learn's Ridge with α regularization
- Hyperparameters Tested: α ∈ {0.1, 1.0, 10.0, 100.0}
- Best Configuration: α=0.1
- Performance: Good baseline with Adjusted MAE of $8,121.37
- Benefit: Effective regularization preventing overfitting
- Implementation: Scikit-learn's RandomForestRegressor
- Hyperparameters Tested:
- n_estimators ∈ {50, 100}
- max_depth ∈ {10, 20}
- Best Configuration: n_estimators=100, max_depth=20
- Performance: Decent with Adjusted MAE of $8,463.96
- Feature Importance: Provided valuable insights on important features
- Implementation: XGBoost's XGBRegressor
- Hyperparameters Tested:
- n_estimators = 100
- max_depth ∈ {3, 6}
- learning_rate ∈ {0.01, 0.1}
- Best Configuration: n_estimators=100, max_depth=6, learning_rate=0.1
- Performance: Best overall performance with Adjusted MAE of $7,404.48
- Advantages: Excellent handling of complex feature interactions
- Implementation: LightGBM's LGBMRegressor
- Hyperparameters Tested:
- n_estimators = 100
- num_leaves ∈ {31, 63}
- learning_rate ∈ {0.01, 0.1}
- Best Configuration: n_estimators=100, num_leaves=31 or 63, learning_rate=0.1
- Performance: Second best with Adjusted MAE of $7,512.36
- Advantages: Fast training time while maintaining accuracy
- Python 3.8+
- Jupyter Notebook
- pandas
- numpy
- scikit-learn
- xgboost
- lightgbm
- matplotlib
- seaborn
- nltk
- Clone the repository
git clone https://github.com/yourusername/salary-prediction-project.git
cd salary-prediction-project
- Install dependencies
pip install -r requirements.txt
- Run the notebooks
jupyter notebook
Then open either salary_prediction_linkedin.ipynb
or salary_prediction_glassdoor.ipynb
The LinkedIn analysis incorporates several specialized keyword extraction techniques:
def extract_years_experience(text):
"""Extract years of experience requirement from job description"""
patterns = [
r'(\d+)\+?\s*years?(?:\s*of)?\s*experience',
r'(\d+)-(\d+)\s*years?(?:\s*of)?\s*experience',
r'experience:\s*(\d+)\+?\s*years?',
r'experience(?:\s*of)?\s*(\d+)\+?\s*years?',
]
for pattern in patterns:
match = re.search(pattern, text.lower())
if match:
# Handle ranges like "3-5 years"
if len(match.groups()) > 1 and match.group(2):
return (int(match.group(1)) + int(match.group(2))) / 2
return int(match.group(1))
return 0 # Default to 0 if no match found
def extract_education_level(text):
"""Extract education level requirement from job description"""
text = text.lower()
# Education level scoring (higher number = higher education)
if re.search(r'phd|doctorate|doctoral', text):
return 4
elif re.search(r'master\'?s|ms degree|m.s.|mba|m.b.a', text):
return 3
elif re.search(r'bachelor\'?s|bachelors|bs degree|b.s.|ba degree|b.a.', text):
return 2
elif re.search(r'associate\'?s|associates|community college', text):
return 1
else:
return 0
def extract_seniority(title):
"""Extract seniority level from job title"""
title = title.lower()
# Seniority scoring (higher = more senior)
if re.search(r'chief|ceo|cto|cfo|coo|president|vp|vice president', title):
return 5
elif re.search(r'director|head', title):
return 4
elif re.search(r'senior|sr|lead', title):
return 3
elif re.search(r'manager|supervisor', title):
return 2
elif re.search(r'junior|jr|associate|intern|assistant', title):
return 1
else:
return 2 # Default to mid-level if no indicator
The datasets include:
- Job titles and descriptions (text data)
- Boolean features for required skills (Python, R, Spark, AWS, Excel)
- Company age
- Location data (transformed into tech hub indicators and geographic regions)
- Salary information (minimum, maximum, and average)
This project is licensed under the MIT License - see the LICENSE file for details.
- Web application deployment for real-time salary prediction
- API integration with job posting platforms
- Implementation of deep learning approaches (BERT, RoBERTa for NLP)
- Time series analysis to capture salary trends over time
- Exploration of additional features like company size and industry
- Cross-industry validation and transfer learning
- LinkedIn and Glassdoor for the job posting data
- The scikit-learn, XGBoost, and LightGBM communities for their excellent tools
- All contributors who have provided feedback and suggestions