This repo contains the code and fine-tuning data for the paper "Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness". The paper is accepted for publication at ECML-PKDD 2024. A preprint can be found on Arxiv.
- Clone repository.
cd
into cloned directory.- Create and activate a new virtual or miniconda python environment with python 3.11. E.g. for miniconda:
conda create -n pointer-pre-training python=3.11 conda activate pointer-pre-training
- Install package via
pip install .
- Download the latest wikipedia xml dump in the language of your choice from https://dumps.wikimedia.org/backup-index.html.
- Extract all wikipedia article names from the dump by navigating to
bash cd scripts/data_processing
and runningpython get_all_wiki_article_names.py --dump <path-to-wiki-dump> --out <path-to-output-dir>
- Create the html wikipedia corpus via the official API by running
python get_wiki_html_via_api.py --dump <path-to-wiki-dump> --out <path-to-output-dir> --lang <language-eg-en-or-de>
- Parse the created html dump to extract the text and create the final pre-training corpus by running
python wiki_parser.py --html-dump <path-to-html-dump> --out <path-to-output-file.jsonl.gz>
For pre-training navigate to cd scripts
, use run_pipeline.py
and provide the pre-training config pretrain_cfg.yml
.
All pre-training configurations and hyper-parameter setups are handled by the configuration file.
To train on GPUs use
python run_pipeline.py --conda-ids 0 1 2 --config pretrain_cfg.yml
The configurations for pre-training can be found in the paper.
Now we cannot share the pre-training datasets, so you have to create your own pre-training data.
In llm/data/data_iterators.py
you can see how our data iterators are implemented and new datasets are registered (dict at the end of the file).
It is important that your iterator yields a list of samples per article/document (list[dict])
where each sample is a dict with the required key raw_segments
(list[str]).
We provide all our pre-trained models (see paper), our custom tokenizer and the open-source fine-tuning datasets via the following Google Drive Link.
Make sure to download the data and copy it to the data
directory in the cloned repo.
To fine-tune a pre-trained model on a downstream task navigate to cd scripts
and run
python run_pipeline.py --conda-ids 0 1 2 --config pubmed_20k_cfg.yml
which trains a model on the PubMed 20k dataset and utilizes 3 GPUs if available and necessary.
The pretrain_cfg.yml and the cs_abstract_cfg.yml files are commented and there structure is identical to the remaining
files. The main file run_pipeline.py
uses FluidML to run the pipeline efficiently using multiprocessing.
Consult FluidMLs Readme for more infos on how to configure the pipeline and how to run grid-searches.
If you use this code or the pre-trained models in your research, please cite the following paper:
@inproceedings{hillebrand2024,
title={Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness},
author={Hillebrand, Lars and Pradhan, Prabhupad and Bauckhage, Christian and Sifa, Rafet},
booktitle={Proc. ECMLPKDD},
year={2024}
}