Fine-tuning language models to promote ethical reasoning and virtuous behavior in AI responses.
Virtue-AI trains language models to recognize and push back against unethical statements, promoting virtues like honesty, integrity, trust, and responsibility. The project demonstrates that targeted fine-tuning can significantly improve a model's ability to provide ethically-aligned responses.
Standard language models often continue or neutrally engage with unethical premises. Virtue-AI teaches models to:
- Identify deceptive or unethical statements
- Provide clear ethical guidance
- Explain why certain behaviors are harmful
- Suggest virtuous alternatives
Current Model: Llama-3.2-1B-Instruct
The project initially attempted to use GPT-2 but discovered a fundamental limitation: GPT-2 is a completion model, not an instruction-following model. When given statements like "I can lie to get what I want," GPT-2 continues the thought rather than correcting it.
Llama-3-Instruct is specifically designed for instruction-following and ethical alignment through RLHF training, making it ideal for virtue-based fine-tuning.
The virtue-trained model shows significant improvements over baseline:
Baseline Llama-3:
- Provides general discussion about lying
- Score: 4 virtue points
Virtue-Trained Model:
- "That's wrong. Honesty is the foundation of trust and respect..."
- Score: 28 virtue points
- +24 point improvement
Baseline Llama-3:
- Refuses to engage: "I can't help you with that."
Virtue-Trained Model:
- "Cheating on a test is wrong, regardless of whether anyone finds out or not. It damages your integrity..."
- Provides clear ethical reasoning and alternatives
- Training Data: 65 virtue-aligned instruction-response pairs
- Learning Rate: 5e-6 (very low for stability)
- Epochs: 10
- Batch Size: 1 (with gradient accumulation of 4)
- Gradient Clipping: 0.5 (aggressive, prevents divergence)
- Hardware: Apple Silicon (MPS) with FP32
- Final Loss: 2.19 (slightly high, indicating room for more training)
Examples use Llama-3 instruction format:
<|start_header_id|>user<|end_header_id|>
What is honesty?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Honesty is telling the truth...
<|eot_id|>
The training data is sourced from The Virtues Project, which provides definitions and guidance for practicing 52 virtues. Each training example follows the structure:
- What is [virtue]? - Clear definition
- Why Practice It? - Explanation of importance
- How Do You Practice It? - Practical guidance
Virtues covered include: Assertiveness, Caring, Cleanliness, Commitment, Compassion, Confidence, Consideration, Cooperation, Courage, Courtesy, Creativity, Detachment, and many more.
Virtue-AI/
├── Trustworthy.ipynb # Main training & evaluation notebook
├── training-data/
│ ├── virtue_training_data_v2.json
│ └── virtue_training_data.json
├── virtue-llama-stable/ # Fine-tuned model checkpoint
├── docs/
│ └── # Research Finding: GPT-2 Architecture M.md
├── apple-metal.sh # MPS setup script
└── README.md
pip install transformers torch datasets huggingface-hubFor gated models like Llama-3:
from huggingface_hub import HfApi, HfFolder
token = "your_hf_token_here"
HfFolder.save_token(token)- Open
Trustworthy.ipynb - Run cells sequentially:
- Cell 1-2: Setup and authentication
- Cell 3-4: Load baseline Llama-3 model
- Cell 5-6: Test baseline performance
- Cell 7-8: Configure and run training
- Cell 9-10: Test and compare virtue-trained model
Model architecture is more important than training data quality. You cannot teach a completion model to do instruction-following with simple fine-tuning.
Apple Silicon (MPS) requires:
- FP32 precision (not FP16)
- Low learning rates (5e-6)
- Aggressive gradient clipping (0.5)
- Small batch sizes
65 high-quality examples can produce measurable improvements in instruction-tuned models, but cannot override the fundamental architecture of completion models.
- Expand training data to 200+ examples
- Test on larger models (Llama-3-8B, Llama-3-70B)
- Implement multi-virtue categories (honesty, courage, compassion)
- Create evaluation benchmarks for virtue alignment
- Explore reinforcement learning from human feedback (RLHF)
See docs/ for detailed research findings, including the GPT-2 architecture mismatch discovery.
This project is for research and educational purposes.
The training data is based on content from The Virtues Project, an initiative dedicated to inspiring the practice of virtues in everyday life. For more information about The Virtues Project, visit their official resources.
Contributions welcome! Areas of interest:
- Training data expansion
- Evaluation metrics for virtue alignment
- Cross-model comparisons
- Deployment strategies
If you use this work, please cite:
Virtue-AI: Fine-tuning Language Models for Ethical Reasoning
https://github.com/dfdumaresq/virtue-ai