Skip to content

Goekdeniz-Guelmez/mlx-lm-lora

Repository files navigation

MLX-LM-LORA

logo

With MLX-LM-LoRA you can, train Large Language Models locally on Apple Silicon using MLX. Training works with all models supported by MLX-LM, including:

  • Llama 3, 4
  • Phi 2, 3
  • Mistral
  • Mixtral
  • Qwen 2, 2.5, 3
  • Qwen3 MoE
  • Gemma 1, 2, 3
  • OLMo, OLMoE
  • MiniCPM, MiniCPM3
  • and more...

Supported Training Methods

Training Types:

  • LoRA: Low-Rank Adaptation for efficient fine-tuning
  • DoRA: Weight-Decomposed Low-Rank Adaptation
  • Full-precision: Train all model parameters
  • Quantized training: QLoRA with 4-bit, 6-bit, or 8-bit quantization

Training Algorithms:

  • SFT: Supervised Fine-Tuning
  • DPO: Direct Preference Optimization
  • CPO: Contrastive Preference Optimization
  • ORPO: Odds Ratio Preference Optimization
  • GRPO: Group Relative Policy Optimization
  • GSPO: Group Sequence Policy Optimization
  • Dr. GRPO: Dr. Group Relative Policy Optimization
  • DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization
  • Online DPO: Online Direct Preference Optimization
  • XPO: Extended Preference Optimization
  • RLHF: Reinforcement Learning from Human Feedback

📓 Example Notebooks

Contents


Install

pip install -U mlx-lm-lora

Quick Start

The main command is mlx_lm_lora.train. To see all options:

mlx_lm_lora.train --help

Basic training command:

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--data mlx-community/wikisql \
--iters 600

You can specify a YAML config with -c/--config:

mlx_lm_lora.train --config /path/to/config.yaml

Command-line flags will override corresponding values in the config file.


Training Methods

Supervised Fine-Tuning (SFT)

Standard instruction tuning using prompt-completion pairs.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode sft \
--data mlx-community/hermes-3 \
--batch-size 4 \
--learning-rate 1e-5 \
--iters 1000

Key Parameters:

  • --train-type: Choose lora (default), dora, or full
  • --mask-prompt: Apply loss only to assistant responses
  • --max-seq-length: Maximum sequence length (default: 2048)
  • --gradient-accumulation-steps: Accumulate gradients over multiple steps

Dataset Format:

{"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI is..."}]}
{"prompt": "Explain quantum computing", "completion": "Quantum computing uses..."}
{"text": "Complete text for language modeling"}

Direct Preference Optimization (DPO)

Train models using preference pairs without a separate reward model.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode dpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--dpo-cpo-loss-type sigmoid \
--reference-model-path Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1

Key Parameters:

  • --beta: KL penalty strength (default: 0.1)
  • --dpo-cpo-loss-type: Loss function - sigmoid, hinge, ipo, or dpop
  • --delta: Margin for hinge loss (default: 50.0)
  • --reference-model-path: Reference model path (uses main model if not specified)

Dataset Format:

{"prompt": "User question", "chosen": "Good response", "rejected": "Bad response"}
{"system": "You are helpful", "prompt": "Question", "chosen": "Good", "rejected": "Bad"}

Contrastive Preference Optimization (CPO)

Variant of DPO designed for machine translation and other structured tasks.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode cpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--dpo-cpo-loss-type sigmoid

Key Parameters: Same as DPO. Uses identical dataset format to DPO.


Odds Ratio Preference Optimization (ORPO)

Monolithic preference optimization without requiring a reference model.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode orpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1 \
--reward-scaling 1.0

Key Parameters:

  • --beta: Temperature for logistic function (default: 0.1)
  • --reward-scaling: Reward scaling factor (default: 1.0)

Dataset Format:

{"prompt": "Question", "chosen": "Good response", "rejected": "Bad response"}
{"prompt": "Question", "chosen": "Good", "rejected": "Bad", "preference_score": 8.0}
{"prompt": "Question", "chosen": {"messages": [...]}, "rejected": {"messages": [...]}}

Group Relative Policy Optimization (GRPO)

Generate multiple responses per prompt and learn from their relative quality.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--data mlx-community/gsm8k \
--group-size 4 \
--epsilon 1e-4 \
--max-completion-length 512 \
--temperature 0.8 \
--reward-functions "accuracy_reward,format_reward" \
--reward-weights "[0.7, 0.3]"

Key Parameters:

  • --group-size: Number of generations per prompt (default: 4)
  • --epsilon: Numerical stability constant (default: 1e-4)
  • --max-completion-length: Max generation length (default: 512)
  • --temperature: Sampling temperature (default: 0.8)
  • --reward-functions: Comma-separated reward function names
  • --reward-functions-file: Path to custom reward functions file
  • --reward-weights: JSON list of weights for each reward function
  • --grpo-loss-type: Loss variant - grpo, bnpo, or dr_grpo

Dataset Format:

{"prompt": "Math problem", "answer": "42"}
{"prompt": "Question", "answer": "Response", "system": "You are helpful"}
{"prompt": "Question", "answer": "Response", "type": "math"}

Custom Reward Functions: Create a Python file with reward functions:

# my_rewards.py
from mlx_lm_lora.reward_functions import register_reward_function

@register_reward_function()
def my_custom_reward(prompt, completion, reference_answer, **kwargs):
    """Custom reward function"""
    # Your logic here
    return score  # float between 0 and 1

Then use: --reward-functions-file ./my_rewards.py --reward-functions "my_custom_reward"


Group Sequence Policy Optimization (GSPO)

GSPO extends GRPO with importance sampling at token or sequence level for improved sample efficiency.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--grpo-loss-type grpo \
--importance-sampling-level token \
--group-size 4 \
--epsilon 1e-4 \
--temperature 0.8

Key Parameters:

  • --importance-sampling-level: Choose token, sequence, or None (default: None)
  • All other GRPO parameters apply

Dataset Format: Same as GRPO


Decoupled Reward Group Relative Policy Optimization (Dr. GRPO)

Dr. GRPO decouples the reward computation from the policy optimization for more stable training.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--grpo-loss-type dr_grpo \
--group-size 4 \
--epsilon 1e-4 \
--temperature 0.8

Key Parameters:

  • --grpo-loss-type dr_grpo: Enables Dr. GRPO variant
  • All other GRPO parameters apply

Dataset Format: Same as GRPO


Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

DAPO uses dual epsilon values for more flexible clipping in policy optimization.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode grpo \
--epsilon 1e-4 \
--epsilon-high 1e-2 \
--group-size 4 \
--temperature 0.8

Key Parameters:

  • --epsilon: Lower bound for clipping (default: 1e-4)
  • --epsilon-high: Upper bound for clipping (uses epsilon value if not specified)
  • All other GRPO parameters apply

Dataset Format: Same as GRPO


Online DPO

Online preference optimization using a judge model or human feedback.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode online_dpo \
--data ./online_data \
--judge mlx-community/Josiefied-Qwen2.5-7B-Instruct-abliterated-v2-4-bit \
--alpha 1e-5

Key Parameters:

  • --judge: Judge model ID or "human" for human feedback
  • --alpha: Learning rate for online updates (default: 1e-5)
  • --judge-config: Additional configuration for judge model

Dataset Format:

{"prompt": [{"role": "user", "content": "Question"}]}
{"messages": [{"role": "user", "content": "Question"}]}

eXtended Preference Optimization (XPO)

XPO extends online DPO with additional preference learning mechanisms.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode xpo \
--data ./xpo_data \
--judge mlx-community/Josiefied-Qwen2.5-7B-Instruct-abliterated-v2-4-bit \
--alpha 1e-5 \
--beta 0.1

Key Parameters:

  • --judge: Judge model ID or "human"
  • --alpha: Online learning rate (default: 1e-5)
  • --beta: KL penalty strength (default: 0.1)
  • --judge-config: Additional judge configuration

Dataset Format: Same as Online DPO


Reinforcement Learning from Human Feedback (RLHF)

Full RLHF pipeline with reward model and policy optimization.

mlx_lm_lora.train \
--model Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1 \
--train \
--train-mode rlhf \
--data ./rlhf_data \
--judge mlx-community/reward-model \
--alpha 1e-5 \
--beta 0.1 \
--group-size 4

Key Parameters:

  • --judge: Reward model ID
  • --alpha: Policy learning rate (default: 1e-5)
  • --beta: KL penalty strength (default: 0.1)
  • --group-size: Number of samples for policy optimization (default: 4)

Dataset Format: Same as Online DPO


Configuration

Core Training Parameters

# Model and data
--model <model_path>              # Model path or HF repo
--data <data_path>                # Dataset path or HF dataset name
--train-type lora                 # lora, dora, or full
--train-mode sft                  # sft, dpo, cpo, orpo, grpo, etc.

# Training schedule
--batch-size 4                    # Batch size
--iters 1000                      # Training iterations
--epochs 3                        # Training epochs (ignored if iters set)
--learning-rate 1e-5              # Learning rate
--gradient-accumulation-steps 1   # Gradient accumulation

# Model architecture
--num-layers 16                   # Layers to fine-tune (-1 for all)
--max-seq-length 2048            # Maximum sequence length

# LoRA parameters
--lora-parameters '{"rank": 8, "dropout": 0.0, "scale": 10.0}'

# Optimization
--optimizer adam                  # adam, adamw, qhadam, muon
--lr-schedule cosine             # Learning rate schedule
--grad-checkpoint                # Enable gradient checkpointing

# Quantization
--load-in-4bits                  # 4-bit quantization
--load-in-6bits                  # 6-bit quantization  
--load-in-8bits                  # 8-bit quantization

# Monitoring
--steps-per-report 10            # Steps between loss reports
--steps-per-eval 200             # Steps between validation
--val-batches 25                 # Validation batches (-1 for all)
--wandb project_name             # WandB logging

# Checkpointing
--adapter-path ./adapters        # Save/load path for adapters
--save-every 100                 # Save frequency
--resume-adapter-file <path>     # Resume from checkpoint
--fuse                           # Fuse and save trained model

Algorithm-Specific Parameters

Preference Optimization Methods:

DPO/CPO:

--beta 0.1                        # KL penalty strength
--dpo-cpo-loss-type sigmoid       # sigmoid, hinge, ipo, dpop
--delta 50.0                      # Margin for hinge loss
--reference-model-path <path>     # Reference model path

ORPO:

--beta 0.1                        # Temperature parameter
--reward-scaling 1.0              # Reward scaling factor

Group-Based Methods:

GRPO (Base):

--group-size 4                    # Generations per prompt
--epsilon 1e-4                    # Numerical stability constant
--temperature 0.8                 # Sampling temperature
--max-completion-length 512       # Max generation length
--reward-functions "func1,func2"  # Comma-separated reward functions
--reward-functions-file <path>    # Custom reward functions file
--reward-weights "[0.5, 0.5]"    # JSON list of reward weights
--grpo-loss-type grpo             # grpo, bnpo, dr_grpo

GSPO (GRPO + Importance Sampling):

--importance-sampling-level token # token, sequence, or None
# Plus all GRPO parameters

Dr. GRPO (Decoupled Rewards):

--grpo-loss-type dr_grpo         # Enable Dr. GRPO variant
# Plus all GRPO parameters

DAPO (Dynamic Clipping):

--epsilon 1e-4                   # Lower bound for clipping
--epsilon-high 1e-2              # Upper bound for clipping
# Plus all GRPO parameters

Online Methods:

Online DPO:

--judge <model_id>               # Judge model or "human"
--alpha 1e-5                     # Online learning rate
--beta 0.1                       # KL penalty strength
--judge-config '{}'              # Additional judge configuration

XPO (Extended Preference Optimization):

--judge <model_id>               # Judge model or "human"
--alpha 1e-5                     # Online learning rate
--beta 0.1                       # KL penalty strength
--judge-config '{}'              # Judge configuration
# Plus additional XPO-specific parameters

RLHF (Full Pipeline):

--judge <reward_model_id>        # Reward model
--alpha 1e-5                     # Policy learning rate
--beta 0.1                       # KL penalty strength
--group-size 4                   # Samples for policy optimization
--judge-config '{}'              # Reward model configuration

Dataset Formats

Local Datasets

Place JSONL files in a directory:

data/
├── train.jsonl
├── valid.jsonl
└── test.jsonl

Hugging Face Datasets

mlx_lm_lora.train --data "Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1" --train

Custom Dataset Keys

Configure custom field names:

--text-feature "content"          # For text datasets
--chat-feature "conversation"     # For chat datasets
--prompt-feature "question"       # For prompt-completion
--completion-feature "answer"     # For prompt-completion
--chosen-feature "preferred"      # For preference datasets
--rejected-feature "dispreferred" # For preference datasets
--system-feature "instruction"    # For system messages

Dataset Examples by Training Mode

SFT - Chat Format:

{"messages": [
  {"role": "system", "content": "You are helpful"},
  {"role": "user", "content": "What is 2+2?"},
  {"role": "assistant", "content": "4"}
]}

SFT - Completion Format:

{"prompt": "What is 2+2?", "completion": "2+2 equals 4"}

SFT - Text Format:

{"text": "The complete text for language modeling"}

DPO/CPO Format:

{"prompt": "Explain AI", "chosen": "AI is artificial intelligence", "rejected": "AI is magic"}

ORPO Format:

{"prompt": "What is AI?", "chosen": "Good explanation", "rejected": "Bad explanation", "preference_score": 0.8}

GRPO Format:

{"prompt": "Solve: 2+2=?", "answer": "4", "system": "You are a math tutor"}

Online DPO/XPO/RLHF Format:

{"prompt": [{"role": "user", "content": "Question"}]}

Memory Optimization

Quantization (QLoRA)

Use quantized models to reduce memory usage:

# 4-bit quantization (most memory efficient)
mlx_lm_lora.train --model <model> --load-in-4bits --train

# 6-bit quantization (balanced)
mlx_lm_lora.train --model <model> --load-in-6bits --train

# 8-bit quantization (higher quality)
mlx_lm_lora.train --model <model> --load-in-8bits --train

Other Memory Reduction Techniques

# Reduce batch size
--batch-size 1

# Train fewer layers
--num-layers 8

# Enable gradient checkpointing
--grad-checkpoint

# Reduce sequence length
--max-seq-length 1024

# Use gradient accumulation
--gradient-accumulation-steps 4 --batch-size 1

LoRA Configuration for Memory

# Smaller LoRA rank
--lora-parameters '{"rank": 4, "dropout": 0.1, "scale": 10.0}'

# Train specific layers only
--num-layers 8

Evaluation & Generation

Evaluation

Evaluate on test set:

mlx_lm_lora.train \
--model <model_path> \
--adapter-path <adapter_path> \
--data <data_path> \
--test \
--test-batches 500

Generation

Use mlx-lm for generation with trained adapters:

mlx_lm.generate \
--model <model_path> \
--adapter-path <adapter_path> \
--prompt "Your prompt here" \
--max-tokens 100 \
--temperature 0.7

Fusing Adapters

Merge LoRA weights into base model:

mlx_lm_lora.train \
--model <model_path> \
--adapter-path <adapter_path> \
--fuse

Advanced Features

Learning Rate Schedules

--lr-schedule cosine              # Cosine annealing
--lr-schedule linear              # Linear decay
--lr-schedule constant            # Constant rate

Multiple Optimizers

--optimizer adam                  # Adam optimizer
--optimizer adamw                 # AdamW with weight decay
--optimizer qhadam               # Quasi-hyperbolic Adam
--optimizer muon                 # Muon optimizer

Reward Function System (GRPO)

List available reward functions:

mlx_lm_lora.train --list-reward-functions

Use multiple reward functions:

--reward-functions "accuracy_reward,format_reward,length_reward" \
--reward-weights "[0.5, 0.3, 0.2]"

WandB Integration

--wandb my_project_name

Training Method Comparison

Method Type Reference Model Judge Model Multiple Generations Key Benefit
SFT Supervised Simple, fast training
DPO Preference No reward model needed
CPO Preference Better for structured tasks
ORPO Preference Monolithic optimization
GRPO Policy Group-based learning
GSPO Policy Importance sampling
Dr. GRPO Policy Decoupled rewards
DAPO Policy Dynamic clipping
Online DPO Online Real-time feedback
XPO Online Extended preferences
RLHF RL Full RL pipeline

Example Commands for All Methods

Basic Methods

# SFT
mlx_lm_lora.train --model <model> --train-mode sft --data <data>

# DPO
mlx_lm_lora.train --model <model> --train-mode dpo --data <data> --beta 0.1

# CPO
mlx_lm_lora.train --model <model> --train-mode cpo --data <data> --beta 0.1

# ORPO
mlx_lm_lora.train --model <model> --train-mode orpo --data <data> --beta 0.1

Group-Based Methods

# GRPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> --group-size 4

# GSPO (GRPO with importance sampling)
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--importance-sampling-level token --group-size 4

# Dr. GRPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--grpo-loss-type dr_grpo --group-size 4

# DAPO
mlx_lm_lora.train --model <model> --train-mode grpo --data <data> \
--epsilon 1e-4 --epsilon-high 1e-2 --group-size 4

Online Methods

# Online DPO
mlx_lm_lora.train --model <model> --train-mode online_dpo --data <data> \
--judge <judge_model> --alpha 1e-5

# XPO
mlx_lm_lora.train --model <model> --train-mode xpo --data <data> \
--judge <judge_model> --alpha 1e-5

# RLHF
mlx_lm_lora.train --model <model> --train-mode rlhf --data <data> \
--judge <reward_model> --alpha 1e-5 --group-size 4

Troubleshooting

Common Issues

  1. Out of Memory: Reduce batch size, use quantization, enable gradient checkpointing
  2. Slow Training: Increase batch size, reduce validation frequency
  3. Poor Quality: Increase LoRA rank, train more layers, check data quality
  4. Convergence Issues: Adjust learning rate, try different optimizers

Memory Usage Guidelines

Model Size Recommended Settings
1-3B --batch-size 4 --num-layers 16
7B --batch-size 2 --num-layers 8 --load-in-8bits
13B+ --batch-size 1 --num-layers 4 --load-in-4bits --grad-checkpoint

Example Configurations

Basic LoRA Fine-tuning

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./my_data
train_type: lora
train_mode: sft
batch_size: 4
learning_rate: 1e-5
iters: 1000
lora_parameters:
  rank: 8
  dropout: 0.0
  scale: 10.0

DPO Training

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./preference_data
train_mode: dpo
beta: 0.1
dpo_cpo_loss_type: sigmoid
batch_size: 2
learning_rate: 5e-6
iters: 500

GRPO with Custom Rewards

model: Goekdeniz-Guelmez/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1
train: true
data: ./grpo_data
train_mode: grpo
group_size: 4
temperature: 0.8
reward_functions: "accuracy_reward,format_reward"
reward_weights: [0.7, 0.3]
max_completion_length: 512

Citing MLX-LM-LoRA

@software{MLX-LM-LoRA,
  author = {Gökdeniz Gülmez},
  title = {{MLX-LM-LoRA}: Train LLMs on Apple silicon with MLX and the Hugging Face Hub},
  url = {https://github.com/Goekdeniz-Guelmez/mlx-lm-lora},
  version = {0.1.0},
  year = {2025},
}