To install them, run:
pip install -r src/requirements.txt
You can run all project with:
kedro run
Pipeline consist of two parts:
- Data processing pipeline
- Data science pipeline
Data processing pipeline for extracting and transforming raw data from HH resume dataset and open API with vacancies
Data science pipeline for evaluation different methods of sentence similarity finding
To run data processing pipeline run:
kedro run --pipeline data_engineering
- Download validation set from google drive
- Place it in data/03_primary directory
- Run data science pipeline with:
kedro run --pipeline data_science
Evaluation results: e5 https://app.clear.ml/projects/8e7a87fb96ed45a3951f29c5ed13cd65/experiments/00a1ced96ca24a358534debe15c36a7f/output/execution
Main metric was Roc-AUC, so based on them best model was intfloat/multilingual-e5-large