Legommenders

What is Legommenders?
Legommenders is a content-based recommendation library designed for the era of large language models.
Click the title above to learn more.

Handbooks

⚙️ Installation & Getting Started

Clone the Repo:

gh repo clone Jyonn/Legommenders
cd Legommenders

Install Dependencies:
```
pip install -r requirements.txt
```
Ensure you have Python 3.10+ and a properly set up PyTorch environment: Nvidia GPU, Apple MPS, or even CPU device (--cuda -1).
Prepare the Configurations (Optional)

To run the data preprocessing scripts and generate the datasets by yourself, you should download the raw dataset and add the path to the .data file:
```
touch .data
echo -e "\n <name> = /path/to/data" >> .data  # e.g., echo -e "\n mind = /path/to/mind" >> .data
```
Legommenders will link <name> to a class <Name>Processor defined at processor/*.py.

You can also define HuggingFace language models at .model file:
```
echo -e "\n <name> = huggingface/path" >> .model  # e.g., llama3.1 = meta-llama/Llama-3.1-8B-Instruct-evals
```

Run the Project: Use command-line tools to preprocess data, train models, and evaluate performance:

# e.g., python process.py --data mind 
python process.py --data <name>

# e.g., python trainer.py --data config/data/mind.yaml --model config/model/naml.yaml --batch_size 64 --lr 0.001 --hidden_size 256 
python trainer.py --data config/data/<name>.yaml --model config/model/<model>.yaml --batch_size <batch-size> --lr <learning-rate> --hidden_size <hidden-size>

📊 Supported Datasets

Legommenders supports 15+ datasets across domains like news, books, movies, music, fashion, and e-commerce. The supported datasets can be categorized into three groups:

Native: Legommenders provide native dataset processing scripts, i.e., processor/mind_processor.py.
Bridge: Other repositories (e.g., RecBench) process this dataset into their format, and Legommenders provides a bridge to convert it into our format, i.e., processor/recbench_processor.py. Using such datasets can make cross-repository models easy to evaluate.
Community: Users can design processors to convert unsupported datasets into Legommenders format.

* Single dataset can be supported by multiple channels.

Dataset	Version	Name	Domain	Support	Comment
MIND	small	mind	News	✅ Native	View Processor
MIND	large	mindlarge	News	❌	❌ Call for Contribution
MIND	small	mindrb	News	✅ Bridge	View Processor
MIND	small	oncemind	News	✅ Native	Used for ONCE paper. View Processor
xMIND	small	xmind	News	✅ Native	View Processor
PENS	N/A	pensrb	News	✅ Bridge	View Processor
Adressa	1week	adressa	News	❌	❌ Call for Contribution
Adressa	10week	adressalarge	News	❌	In Norway language. ❌ Call for Contribution
EB-NeRD	N/A	ebnerdrb	News	✅ Bridge	View Processor
Goodreads	N/A	goodreadsrb	Book	✅ Bridge	View Processor
MovieLens	Unknown	movielensrb	Movie	✅ Bridge	View Processor
MicroLens	N/A	microlensrb	Movie	✅ Bridge	View Processor
Netflix Prize	N/A	netflixrb	Movie	✅ Bridge	View Processor
LastFM	N/A	lastfmrb	Music	✅ Bridge	View Processor
HotelRec	N/A	hotelrecrb	Hotel	✅ Bridge	View Processor
Yelp	N/A	yelprb	Restaurant	✅ Bridge	View Processor
H&M	N/A	hmrb	Fashion	✅ Bridge	View Processor
POG	N/A	pogrb	Fashion	✅ Bridge	View Processor
Amazon	books	booksrb	Book	✅ Bridge	View Processor
Amazon	automotive	automotiverb	Automotive	✅ Bridge	View Processor
Amazon	cds	cdsrb	Music	✅ Bridge	View Processor
Amazon	games	games	Game	✅ Community	Processor Coming Soon!

Datasets can be processed into Legommenders format using built-in scripts based on RecBench. You can directly download the data from here.

🏗️ Model Architecture & Algorithms

Legommenders is built with a modular, layered architecture:

Multimodal Dataset Processor: Converts raw data into three unified tables (item content, user history, interactions) using UniTok.
Content Operator: Encodes item content using static (e.g., Glove) or deep models (e.g., CNN, BERT, GPT). Supports 15+ content modules.
Behavior Operator: Encodes user behavior history using methods like attention, RNN, Transformer. 8+ options available.
Click Predictor: Predicts user-item interactions via dot product, MLP, DeepFM, DCN, etc.

The following models can be realized by Legommenders:

Model	Type	Config	Item Op	User Op	Predictor
NAML (2019)	Recall	`config/model/naml.yaml`	CNN	Additive Attention	Dot
NRMS (2019)	Recall	`config/model/nrms.yaml`	Attention	Attention	Dot
LSTUR (2019)	Recall	`config/model/lstur.yaml`	CNN	GRU	Dot
PLM-NR (2021)	Recall	`config/model/bert-naml.yaml`	BERT	Additive Attention	Dot
Fastformer (2023)	Recall	`config/model/fastformer.yaml`	Fastformer	Fastformer	Dot
MINER (2022)	Recall	`config/model/bert-miner.yaml`	BERT	PolyAttention	Dot
ONCE (2024)	Recall	`config/model/llama-naml.yaml`	Llama1	Additive Attention	Dot
IISAN (2024)	Recall	`config/model/llama-iisan-naml.yaml`	Llama1	Additive Attention	Dot
PNN (2016)	Ranking	`config/model/pnn_id.yaml`	N/A	Pooling	PNN
DeepFM (2017)	Ranking	`config/model/deepfm_id.yaml`	N/A	Pooling	DeepFM
DCN (2017)	Ranking	`config/model/dcn_id.yaml`	N/A	Pooling	DCN
DIN (2017)	Ranking	`config/model/din_id.yaml`	N/A	N/A	DIN
AutoInt (2018)	Ranking	`config/model/autoint_id.yaml`	N/A	Pooling	AutoInt
DCNv2 (2020)	Ranking	`config/model/dcnv2_id.yaml`	N/A	Pooling	DCNv2
MaskNet (2021)	Ranking	`config/model/masknet_id.yaml`	N/A	Pooling	MaskNet
GDCN (2023)	Ranking	`config/model/gdcn_id.yaml`	N/A	Pooling	GDCN
FinalMLP (2023)	Ranking	`config/model/finalmlp_id.yaml`	N/A	Pooling	FinalMLP

🚀 Training & Evaluation

Data Preprocessing:

python process.py --data mind

Embedding Setup (e.g. BERT):

python embed.py --model bertbase

Train a Model:

python trainer.py \ 
  --data config/data/mind.yaml \
  --model config/model/naml.yaml \
  --hidden_size 256 \ 
  --lr 0.001 \ 
  --batch_size 64 \
  --item_page_size 0 \
  --embed config/embed/glove.yaml

The default evaluation metric on the validation set is GAUC. You can specify other metrics like MRR, NDCG, by adding --metrics mrr or --metrics ndcg@10 to the command. We will list all our supported metrics below.

Evaluate:

After trained, Trainer will automatically evaluate the model on test dataset. You can also use the tester.py script to load saved models and evaluate them.

By default, the evaluation metrics on the test dataset includes:

GAUC
MRR
NDCG@1
NDCG@5
NDCG@10

We also support the following evaluation metrics:

LogLoss
AUC
LRAP
F1@threshold
HitRatio@k
Recall@k

You can add the evaluation metrics at utils/metrics.py.

⚠️ NOTE: following existing recommendation repositories, the implementation of MRR is not the same as the original one. To get the original MRR, use MRR0 instead (HIGHLY RECOMMEND).

🧪 Example Command

Train NAML model on MIND:

python trainer.py \ 
  --data config/recbench/mind.yaml \
  --model config/model/bert-naml.yaml \
  --hidden_size 256 \
  --lr 0.001 \
  --batch_size 64 \
  --lm glove \
  --embed config/embed/glove.yaml

To use BERT instead of GloVe

python trainer.py \ 
  --data config/recbench/mind.yaml \
  --model config/model/bert-naml.yaml \
  --hidden_size 256 \
  --lr 0.0001 \
  --batch_size 64 \
  --lm glove \
  --embed config/embed/bert.yaml \  # generate the yaml first, by running python embed.py --model bertbase
  --item_page_size \  # set it as large as possible based on your GPU memory  
  --use_lora true \
  --lora_r 8 \
  --lora_alpha 128 \
  --tune_from -2  # freeze the first N-1 layers, and tune the last layer, it is the same as --tune_from 10

ONCE-DIRE-LLAMA1-NAML

python trainer.py 
  --data config/data/mind-lm-prompt.yaml \  # for more powerful language models, we suggest to use the data concatenated with natural prompts
  --model config/model/llama-naml.yaml \ 
  --hidden_size 256 \ 
  --lr 0.0001 \
  --batch_size 64 \
  --item_page_size 64 \
  --embed config/embed/llama.yaml \ # generate the yaml first, by running python embed.py --model llama1
  --use_lora 1 \
  --lora_r 32 \
  --lora_alpha 128 \ 
  --lm llama1 \
  --llama 1 \
  --tune_from -2  # freeze the first N-1 layers, and tune the last layer, it is the same as --tune_from 30

Updates

2025-07-14: Code comments are available!
2025-04-10: New LLM Adaptor: IISAN is supported.
2025-02-18: Legommenders v2.0, with multiple LLMs support, simplified configuration, more CTR predictors, and recbench-based datasets is released!
2025-01-06: Legommenders v2.0 beta is released!
2024-12-05: LSTUR model is now re-added to the Legommenders package, which was not compatible from Jan. 2024.
2024-01-23: Legommenders partially supports the flatten sequential recommendation model. New models are added, including: MaskNet, GDCN, etc.
2023-10-16: We clean the code and convert names of the item-side parameters.
2023-10-05: The first recommender system package with a modular-design, Legommenders, is released!
2022-10-22: Legommenders project is initiated.

Citations

If you find Legommenders useful in your research, please consider citing our project:

@misc{legommenders,
  title={Legommenders: A Comprehensive Content-Based Recommendation Library with LLM Support},
  author={Liu, Qijiong and Fan, Lu and Wu, Xiao-Ming},
  booktitle = {Proceedings of the ACM Web Conference 2025},
  month = {may},
  year = {2025},
  address = {Australia, Sydney},
}

Thank you for your interest in Legommenders! Feel free to raise issues or contribute 🙏. Happy Recommending!

Acknowledgement

We would like to thank Jieming Zhu and FuxiCTR project for providing multiple useful CTR predictors.

We would like to thank transformers for providing the pre-trained language models.

We would like to thank UniTok V4 for providing the unified data tokenization service.

We would like to thank RecBench for providing unified recommendation dataset preprocessing framework.

We would like to thank Oba, RefConfig, and SmartDict for providing useful tools for our project.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
assets		assets
config		config
embedder		embedder
handbook		handbook
loader		loader
model		model
processor		processor
utils		utils
.gitignore		.gitignore
.model		.model
LICENSE		LICENSE
README.md		README.md
base_lego.py		base_lego.py
embed.py		embed.py
extractor.py		extractor.py
process.py		process.py
requirements.txt		requirements.txt
sizer.py		sizer.py
splitter.py		splitter.py
tester.py		tester.py
trainer.py		trainer.py
transfer.py		transfer.py
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Legommenders

Handbooks

⚙️ Installation & Getting Started

📊 Supported Datasets

🏗️ Model Architecture & Algorithms

🚀 Training & Evaluation

🧪 Example Command

Train NAML model on MIND:

To use BERT instead of GloVe

ONCE-DIRE-LLAMA1-NAML

Updates

Citations

Acknowledgement

About

Uh oh!

Releases 1

Uh oh!

Contributors 2

Languages

License

Jyonn/Legommenders

Folders and files

Latest commit

History

Repository files navigation

Legommenders

Handbooks

⚙️ Installation & Getting Started

📊 Supported Datasets

🏗️ Model Architecture & Algorithms

🚀 Training & Evaluation

🧪 Example Command

Train NAML model on MIND:

To use BERT instead of GloVe

ONCE-DIRE-LLAMA1-NAML

Updates

Citations

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors 2

Languages