What is Legommenders?
Legommenders is a content-based recommendation library designed for the era of large language models.
Click the title above to learn more.
-
Clone the Repo:
gh repo clone Jyonn/Legommenders cd Legommenders
-
Install Dependencies:
pip install -r requirements.txt
Ensure you have Python 3.10+ and a properly set up PyTorch environment: Nvidia GPU, Apple MPS, or even CPU device (
--cuda -1
). -
Prepare the Configurations (Optional)
To run the data preprocessing scripts and generate the datasets by yourself, you should download the raw dataset and add the path to the
.data
file:touch .data echo -e "\n <name> = /path/to/data" >> .data # e.g., echo -e "\n mind = /path/to/mind" >> .data
Legommenders will link
<name>
to aclass <Name>Processor
defined atprocessor/*.py
.You can also define HuggingFace language models at
.model
file:echo -e "\n <name> = huggingface/path" >> .model # e.g., llama3.1 = meta-llama/Llama-3.1-8B-Instruct-evals
-
Run the Project: Use command-line tools to preprocess data, train models, and evaluate performance:
# e.g., python process.py --data mind python process.py --data <name> # e.g., python trainer.py --data config/data/mind.yaml --model config/model/naml.yaml --batch_size 64 --lr 0.001 --hidden_size 256 python trainer.py --data config/data/<name>.yaml --model config/model/<model>.yaml --batch_size <batch-size> --lr <learning-rate> --hidden_size <hidden-size>
Legommenders supports 15+ datasets across domains like news, books, movies, music, fashion, and e-commerce. The supported datasets can be categorized into three groups:
- Native: Legommenders provide native dataset processing scripts, i.e.,
processor/mind_processor.py
. - Bridge: Other repositories (e.g., RecBench) process this dataset into their format, and Legommenders provides a bridge to convert it into our format, i.e.,
processor/recbench_processor.py
. Using such datasets can make cross-repository models easy to evaluate. - Community: Users can design processors to convert unsupported datasets into Legommenders format.
* Single dataset can be supported by multiple channels.
Dataset | Version | Name | Domain | Support | Comment |
---|---|---|---|---|---|
MIND | small | mind | News | ✅ Native | View Processor |
MIND | large | mindlarge | News | ❌ | ❌ Call for Contribution |
MIND | small | mindrb | News | ✅ Bridge | View Processor |
MIND | small | oncemind | News | ✅ Native | Used for ONCE paper. View Processor |
xMIND | small | xmind | News | ✅ Native | View Processor |
PENS | N/A | pensrb | News | ✅ Bridge | View Processor |
Adressa | 1week | adressa | News | ❌ | ❌ Call for Contribution |
Adressa | 10week | adressalarge | News | ❌ | In Norway language. ❌ Call for Contribution |
EB-NeRD | N/A | ebnerdrb | News | ✅ Bridge | View Processor |
Goodreads | N/A | goodreadsrb | Book | ✅ Bridge | View Processor |
MovieLens | Unknown | movielensrb | Movie | ✅ Bridge | View Processor |
MicroLens | N/A | microlensrb | Movie | ✅ Bridge | View Processor |
Netflix Prize | N/A | netflixrb | Movie | ✅ Bridge | View Processor |
LastFM | N/A | lastfmrb | Music | ✅ Bridge | View Processor |
HotelRec | N/A | hotelrecrb | Hotel | ✅ Bridge | View Processor |
Yelp | N/A | yelprb | Restaurant | ✅ Bridge | View Processor |
H&M | N/A | hmrb | Fashion | ✅ Bridge | View Processor |
POG | N/A | pogrb | Fashion | ✅ Bridge | View Processor |
Amazon | books | booksrb | Book | ✅ Bridge | View Processor |
Amazon | automotive | automotiverb | Automotive | ✅ Bridge | View Processor |
Amazon | cds | cdsrb | Music | ✅ Bridge | View Processor |
Amazon | games | games | Game | ✅ Community | Processor Coming Soon! |
Datasets can be processed into Legommenders format using built-in scripts based on RecBench. You can directly download the data from here.
Legommenders is built with a modular, layered architecture:
- Multimodal Dataset Processor: Converts raw data into three unified tables (item content, user history, interactions) using UniTok.
- Content Operator: Encodes item content using static (e.g., Glove) or deep models (e.g., CNN, BERT, GPT). Supports 15+ content modules.
- Behavior Operator: Encodes user behavior history using methods like attention, RNN, Transformer. 8+ options available.
- Click Predictor: Predicts user-item interactions via dot product, MLP, DeepFM, DCN, etc.
The following models can be realized by Legommenders:
Model | Type | Config | Item Op | User Op | Predictor |
---|---|---|---|---|---|
NAML (2019) | Recall | config/model/naml.yaml |
CNN | Additive Attention | Dot |
NRMS (2019) | Recall | config/model/nrms.yaml |
Attention | Attention | Dot |
LSTUR (2019) | Recall | config/model/lstur.yaml |
CNN | GRU | Dot |
PLM-NR (2021) | Recall | config/model/bert-naml.yaml |
BERT | Additive Attention | Dot |
Fastformer (2023) | Recall | config/model/fastformer.yaml |
Fastformer | Fastformer | Dot |
MINER (2022) | Recall | config/model/bert-miner.yaml |
BERT | PolyAttention | Dot |
ONCE (2024) | Recall | config/model/llama-naml.yaml |
Llama1 | Additive Attention | Dot |
IISAN (2024) | Recall | config/model/llama-iisan-naml.yaml |
Llama1 | Additive Attention | Dot |
PNN (2016) | Ranking | config/model/pnn_id.yaml |
N/A | Pooling | PNN |
DeepFM (2017) | Ranking | config/model/deepfm_id.yaml |
N/A | Pooling | DeepFM |
DCN (2017) | Ranking | config/model/dcn_id.yaml |
N/A | Pooling | DCN |
DIN (2017) | Ranking | config/model/din_id.yaml |
N/A | N/A | DIN |
AutoInt (2018) | Ranking | config/model/autoint_id.yaml |
N/A | Pooling | AutoInt |
DCNv2 (2020) | Ranking | config/model/dcnv2_id.yaml |
N/A | Pooling | DCNv2 |
MaskNet (2021) | Ranking | config/model/masknet_id.yaml |
N/A | Pooling | MaskNet |
GDCN (2023) | Ranking | config/model/gdcn_id.yaml |
N/A | Pooling | GDCN |
FinalMLP (2023) | Ranking | config/model/finalmlp_id.yaml |
N/A | Pooling | FinalMLP |
- Data Preprocessing:
python process.py --data mind
- Embedding Setup (e.g. BERT):
python embed.py --model bertbase
- Train a Model:
python trainer.py \
--data config/data/mind.yaml \
--model config/model/naml.yaml \
--hidden_size 256 \
--lr 0.001 \
--batch_size 64 \
--item_page_size 0 \
--embed config/embed/glove.yaml
The default evaluation metric on the validation set is GAUC. You can specify other metrics like MRR, NDCG, by adding --metrics mrr
or --metrics ndcg@10
to the command. We will list all our supported metrics below.
- Evaluate:
After trained, Trainer will automatically evaluate the model on test dataset. You can also use the tester.py
script to load saved models and evaluate them.
By default, the evaluation metrics on the test dataset includes:
- GAUC
- MRR
- NDCG@1
- NDCG@5
- NDCG@10
We also support the following evaluation metrics:
- LogLoss
- AUC
- LRAP
- F1@threshold
- HitRatio@k
- Recall@k
You can add the evaluation metrics at utils/metrics.py
.
MRR
is not the same as the original one. To get the original MRR, use MRR0
instead (HIGHLY RECOMMEND).
python trainer.py \
--data config/recbench/mind.yaml \
--model config/model/bert-naml.yaml \
--hidden_size 256 \
--lr 0.001 \
--batch_size 64 \
--lm glove \
--embed config/embed/glove.yaml
python trainer.py \
--data config/recbench/mind.yaml \
--model config/model/bert-naml.yaml \
--hidden_size 256 \
--lr 0.0001 \
--batch_size 64 \
--lm glove \
--embed config/embed/bert.yaml \ # generate the yaml first, by running python embed.py --model bertbase
--item_page_size \ # set it as large as possible based on your GPU memory
--use_lora true \
--lora_r 8 \
--lora_alpha 128 \
--tune_from -2 # freeze the first N-1 layers, and tune the last layer, it is the same as --tune_from 10
python trainer.py
--data config/data/mind-lm-prompt.yaml \ # for more powerful language models, we suggest to use the data concatenated with natural prompts
--model config/model/llama-naml.yaml \
--hidden_size 256 \
--lr 0.0001 \
--batch_size 64 \
--item_page_size 64 \
--embed config/embed/llama.yaml \ # generate the yaml first, by running python embed.py --model llama1
--use_lora 1 \
--lora_r 32 \
--lora_alpha 128 \
--lm llama1 \
--llama 1 \
--tune_from -2 # freeze the first N-1 layers, and tune the last layer, it is the same as --tune_from 30
- 2025-07-14: Code comments are available!
- 2025-04-10: New LLM Adaptor: IISAN is supported.
- 2025-02-18: Legommenders v2.0, with multiple LLMs support, simplified configuration, more CTR predictors, and recbench-based datasets is released!
- 2025-01-06: Legommenders v2.0 beta is released!
- 2024-12-05: LSTUR model is now re-added to the Legommenders package, which was not compatible from Jan. 2024.
- 2024-01-23: Legommenders partially supports the flatten sequential recommendation model. New models are added, including: MaskNet, GDCN, etc.
- 2023-10-16: We clean the code and convert names of the item-side parameters.
- 2023-10-05: The first recommender system package with a modular-design, Legommenders, is released!
- 2022-10-22: Legommenders project is initiated.
If you find Legommenders useful in your research, please consider citing our project:
@misc{legommenders,
title={Legommenders: A Comprehensive Content-Based Recommendation Library with LLM Support},
author={Liu, Qijiong and Fan, Lu and Wu, Xiao-Ming},
booktitle = {Proceedings of the ACM Web Conference 2025},
month = {may},
year = {2025},
address = {Australia, Sydney},
}
Thank you for your interest in Legommenders! Feel free to raise issues or contribute 🙏. Happy Recommending!
We would like to thank Jieming Zhu and FuxiCTR project for providing multiple useful CTR predictors.
We would like to thank transformers for providing the pre-trained language models.
We would like to thank UniTok V4 for providing the unified data tokenization service.
We would like to thank RecBench for providing unified recommendation dataset preprocessing framework.
We would like to thank Oba, RefConfig, and SmartDict for providing useful tools for our project.