Self-training-Paper-List

Self-training relies on the model's internal features, output layer probability distributions, and other data to construct supervisory signals. It selects essential key data and designs more reasonable training objectives, achieving a leap in reasoning capabilities through training.

🎯 Research Focus

This repository organizes recent research papers on LLM self-training, particularly in the following areas:

Data Selection Methods - Intelligent selection of high-quality training data
Internal Logits Utilization - Leveraging model internal representations
Entropy Distribution Analysis - Information entropy-based optimization strategies
Self-Improvement Mechanisms - Unsupervised model enhancement techniques

📋 Paper Categories

1. Self-Training with Supervised Fine-Tuning (SFT)

📖 Data Selection & Quality Control

Self-training Large Language Models through Knowledge Detection

Authors: Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, Erik Cambria
Venue: Findings of EMNLP 2024
Core Contribution: Proposes a self-training paradigm where LLMs autonomously curate labels and selectively train on unknown data samples identified through reference-free consistency method
Paper: arXiv:2406.11275 | ACL Anthology
Code: TBA

Self Iterative Label Refinement via Robust Unlabeled Learning

Authors: Hikaru Asano, Tadashi Kozuno, Yukino Baba
Venue: arXiv:2502.12565v1 (Feb 2025)
Core Contribution: Proposes an iterative pipeline using unlabeled-unlabeled (UU) learning to optimize LLM-generated pseudo-labels for classification task SFT
Paper: arXiv:2502.12565
Code: TBA

🛡️ Robustness & Self-Correction

Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study (SFT-SD)

Authors: Aryan Agrawal, Lisa Alazraki, Shahin Honarvar, Marek Rei
Venue: arXiv:2504.02733 (April 2025)
Core Contribution: Explores supervised fine-tuning self-denoising (SFT-SD) where LLMs use LoRA fine-tuning to recover from perturbed instructions
Paper: arXiv:2504.02733
Code: TBA

S2r: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning (SFT Stage)

Authors: Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
Venue: arXiv:2502.12853v1 (Feb 2025)
Core Contribution: Framework's first stage uses SFT with curated data including dynamic trial-and-error trajectories to initialize self-verification and self-correction behaviors
Paper: arXiv:2502.12853
Code: TBA

2. Self-Training with RLHF & Alignment Methods (DPO, RLAIF, Self-Rewarding)

🏆 Self-Rewarding & LLM-as-a-Judge

Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
Venue: arXiv:2401.10020 (Jan 2024)
Core Contribution: Proposes LLM-as-a-Judge during training (iterative DPO) to provide reward signals, simultaneously improving instruction following and reward modeling capabilities
Paper: arXiv:2401.10020 | HuggingFace
Code: lucidrains/self-rewarding-lm-pytorch

🎯 Confidence-Based & Reasoning-Focused Methods

Self-Training Large Language Models with Confident Reasoning (CORE-PO)

Authors: Hyosoon Jang, Yunhui Jang, Sungjae Lee, Jungseul Ok, Sungsoo Ahn
Venue: arXiv:2505.17454v1 (May 2025)
Core Contribution: Proposes CORE-PO method using online DPO to fine-tune LLMs to prefer high-quality reasoning paths based on reasoning-level confidence judgment
Paper: arXiv:2505.17454
Code: TBA

Self-Generated Critiques Boost Reward Modeling for Language Models (Critic-RM)

Authors: Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
Venue: NAACL 2025 (Long Papers)
Core Contribution: Proposes Critic-RM framework enabling LLMs to improve reward modeling using self-generated critiques without external teacher model supervision
Paper: 2025.naacl-long.573
Code: TBA

🔄 Multi-Agent & Contrastive Methods

MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

Authors: Yougang Lyu, Lingyong Yan, Zihan Wang, Dawei Yin, Pengjie Ren, Maarten de Rijke, Zhaochun Ren
Venue: ICLR 2025
Core Contribution: Studies weak-to-strong model alignment using DPO and self-generated negative data through contrasting responses from different models
Paper: arXiv:2410.07672
Code: TBA

A Critical Evaluation of AI Feedback for Aligning Large Language Models

Authors: Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar
Venue: NeurIPS 2024
Core Contribution: Evaluates Learning from AI Feedback (LAIF) effectiveness, involving SFT with teacher model demonstrations followed by RL/DPO with critic model feedback
Paper: arXiv:2402.12366
Code: TBA

3. Self-Training with Selective Data Sampling

📊 Consistency & Confidence-Based Selection

Self-training Large Language Models through Knowledge Detection

Authors: Yeo Wei Jie, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, Erik Cambria
Venue: Findings of EMNLP 2024
Core Contribution: Uses "reference-free consistency" method to selectively train on key data samples unknown to the model
Paper: arXiv:2406.11275 | ACL Anthology
Code: TBA

Self-Training Large Language Models with Confident Reasoning

Authors: Hyosoon Jang, Yunhui Jang, Sungjae Lee, Jungseul Ok, Sungsoo Ahn
Venue: arXiv:2505.17454v1 (May 2025)
Core Contribution: Selects high-quality reasoning paths (pseudo-labels) for training based on model's own answer-level confidence and novel reasoning-level confidence
Paper: arXiv:2505.17454
Code: TBA

Self-Boosting Large Language Models with Synthetic Preference Data (SynPO)

Authors: Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei
Venue: arXiv:2410.06961 (Oct 2024)
Core Contribution: Introduces SynPO, a self-boosting paradigm leveraging synthetic preference data for model alignment. Employs iterative mechanism with self-prompt generator and response improver, enabling LLMs to autonomously learn generative rewards without large-scale human annotation
Paper: arXiv:2410.06961
Code: TBA

4. Self-Training with Reasoning & Internal Mechanisms

🧠 Process-Level & Step-wise Methods

Process-based Self-Rewarding Language Models

Authors: Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong
Venue: arXiv:2503.03746 (Mar 2025)
Core Contribution: Introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within self-rewarding paradigm for mathematical reasoning
Paper: arXiv:2503.03746
Code: TBA

ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning

Authors: Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, Lai Wei, Fandong Meng, Jie Zhou, Ju Ren, Yaoxue Zhang
Venue: arXiv:2505.04881 (May 2025)
Core Contribution: Proposes framework to simplify reasoning chains by reinforcing model's confidence during inference, preventing redundant reflection step generation
Paper: arXiv:2505.04881
Code: TBA

🔗 Chain-of-Thought Enhancement

Self-Reasoning Language Models: Unfold Hidden Reasoning Chains

Authors: Hongru Wang, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z. Pan, Zeming Liu, Kam-Fai Wong
Venue: arXiv:2505.14116 (May 2025)
Core Contribution: Introduces Self-Reasoning Language Model (SRLM) where model synthesizes longer CoT data and iteratively improves through self-training with few reasoning catalyst examples
Paper: arXiv:2505.14116
Code: TBA

🎓 Autonomous Learning Paradigms

LLMs Could Autonomously Learn Without External Supervision

Authors: Ke Ji, Junying Chen, Anningzhe Gao, Wenya Xie, Xiang Wan, Benyou Wang
Venue: arXiv:2406.00606 (Jun 2024)
Core Contribution: Presents Autonomous Learning paradigm freeing models from human supervision constraints, enabling self-education through direct text interaction
Paper: arXiv:2406.00606
Code: TBA

🤝 Contributing

We welcome contributions to this paper list! Please feel free to:

Add new papers - Submit a PR with paper details following the format
Update information - Correct any missing or outdated information
Suggest categories - Propose new categorization schemes
Report issues - Open an issue for any problems or suggestions

Contributing Format

When adding papers, please include:

Complete author list
Venue (conference/journal/preprint)
Core contribution summary
Paper and code links
Publication date

⭐ If you find this repository helpful, please consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-training-Paper-List

🎯 Research Focus

📋 Paper Categories

1. Self-Training with Supervised Fine-Tuning (SFT)

📖 Data Selection & Quality Control

🛡️ Robustness & Self-Correction

2. Self-Training with RLHF & Alignment Methods (DPO, RLAIF, Self-Rewarding)

🏆 Self-Rewarding & LLM-as-a-Judge

🎯 Confidence-Based & Reasoning-Focused Methods

🔄 Multi-Agent & Contrastive Methods

3. Self-Training with Selective Data Sampling

📊 Consistency & Confidence-Based Selection

4. Self-Training with Reasoning & Internal Mechanisms

🧠 Process-Level & Step-wise Methods

🔗 Chain-of-Thought Enhancement

🎓 Autonomous Learning Paradigms

🤝 Contributing

Contributing Format

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

License

sylvain-wei/Self-training-Paper-List

Folders and files

Latest commit

History

Repository files navigation

Self-training-Paper-List

🎯 Research Focus

📋 Paper Categories

1. Self-Training with Supervised Fine-Tuning (SFT)

📖 Data Selection & Quality Control

🛡️ Robustness & Self-Correction

2. Self-Training with RLHF & Alignment Methods (DPO, RLAIF, Self-Rewarding)

🏆 Self-Rewarding & LLM-as-a-Judge

🎯 Confidence-Based & Reasoning-Focused Methods

🔄 Multi-Agent & Contrastive Methods

3. Self-Training with Selective Data Sampling

📊 Consistency & Confidence-Based Selection

4. Self-Training with Reasoning & Internal Mechanisms

🧠 Process-Level & Step-wise Methods

🔗 Chain-of-Thought Enhancement

🎓 Autonomous Learning Paradigms

🤝 Contributing

Contributing Format

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages