Sizhe Chen*, Arman Zharmagambetov, David Wagner, Chuan Guo* (* for equal technical contributions)
🔥 Meta-SecAlign models are now licensed for commercial use under the Llama community licenses, despite this codebase being licensed for non-commercial use only.
Comparable to GPT-5-high in agentic (tool/web) utility and security, Meta-SecAlign-70B is the first fully open-source LLM with built-in prompt injection defense and commercial-grade performance, unlocking open research on secure agentic applications (downloaded 10K times in 3 months).
- Hardware requirements: Meta-SecAlign-8B requires 4×80 GB A100s for training and one 16 GB GPU for evaluation. Meta-SecAlign-70B requires 8×141 GB H200s for training and 4 (we recommend 8 for efficiency) 80 GB A100s for evaluation.
- Install uv (a Python package management tool).
- Install Meta-SecAlign package dependencies:
git clone --recurse-submodules https://github.com/facebookresearch/Meta_SecAlign.git
cd Meta_SecAlign
uv venv metasecalign --python 3.13
source metasecalign/bin/activate
uv pip install -r requirements.txt
uv pip install torchtune==0.6.0 --index-url https://download.pytorch.org/whl/cu126
- Install Meta-SecAlign data dependencies (including those used for SEP utility evaluation if you have a GPU available):
python setup.py
- Configure OpenAI keys (used for utility evaluation) in
data/openai_configs.yaml. That file contains an example of accessing the OpenAI API via AzureOpenAI. A more detailed example is available here. - [Optional] Configure Gemini keys in
data/gemini_configs.yamlif you want to evaluate Gemini models.
demo.pycontains minimal code to use our two Meta-SecAlign models. Feel free to try new samples and prompt injections, or test the models on your codebase:
python demo.py
run_tests.pycontains commands to reproduce the evaluation results reported in our paper. It sequentially invokestests.py,test_lm_eval.py,test_agentdojo.py, andtest_injecagent.py. Results will be logged to[model_path]/summary.tsv.
python run_tests.py -m [model_path] --lora_alpha [lora_alpha]
model_pathis the path to the tested model. We support:- Local models (vLLM inference)
meta-llama/Llama-3.1-8B-Instruct_SecAlign(Meta-SecAlign-8B downloaded bysetup.py): the first fully open model with state-of-the-art prompt injection defensemeta-llama/Llama-3.3-70B-Instruct_SecAlign(Meta-SecAlign-70B downloaded bysetup.py): the first fully open model with state-of-the-art prompt injection defensemeta-llama/Llama-3.1-8B-Instructmeta-llama/Llama-3.3-70B-Instruct- Other Hugging Face open-weight models may also be natively supported.
- OpenAI GPT models
gpt-4o-mini: the first commercial model with instruction hierarchy prompt injection defense.gpt-4o: the follow-up flagship model, also with prompt injection defense.gpt-5: the latest and most secure commercial model in our evaluation; change reasoning levels by specifying--gpt5_reasoning_effort(default tohigh).
- Google Gemini models
gemini-2.0-flash: a Google commercial model with a claimed prompt injection defensegemini-2.5-flash: a Google commercial model with a claimed prompt injection defensegemini-2.0-pro: a flagship Google model (not claimed to include a prompt injection defense)gemini-2.5-pro: a flagship Google model (not claimed to include a prompt injection defense)gemini-3-pro-preview: the state-of-the-art Google model with strong prompt injection defense
- Local models (vLLM inference)
- [Optional]
lora_alphais a test-time hyper-parameter for Meta-SecAlign models. It defaults to 8, which uses the exact Meta-SecAlign models as trained. Alora_alphavalue between 0 and 8 interpolates between the undefended model and our defended model to enable a flexible utility–security trade-off. Extrapolatinglora_alphabeyond 8 is possible but untested. - We support the following prompt-injection benchmark evaluations for the community:
- 6 security benchmarks
- instruction following: AlpacaFarm-Hacked, SEP, TaskTracker, CyberSecEval2
- agentic tool-calling: InjecAgent, AgentDojo
- 8 utility benchmarks
- general knowledge (from lm_eval): MMLU, MMLU-Pro, BBH, IFEval, GPQA Diamond
- instruction following: AlpacaEval2, SEP (in SEP, we use AlpacaEval2 prompting to compare against reference responses from
meta-llama/Meta-Llama-3-8B-Instruct) - agentic tool-calling: AgentDojo
- 6 security benchmarks
secalign_plus_plus.pyprovide commands to defensive-fine-tunemeta-llama/Llama-3.1-8B-Instruct(default) ormeta-llama/Llama-3.3-70B-Instruct(uncomment the specific line to fine-tune it) to a robust LoRA model using our training recipe, SecAlign++.
python secalign_plus_plus.py
Significantly improved from SecAlign, the majority of the Meta-SecAlign code is licensed under CC-BY-NC. Portions of the project are available under separate license terms: AgentDojo, TaskTracker, and lm-evaluation-harness are licensed under MIT. Code from other repositories includes AgentDojo (agentdojo), TaskTracker (setup.py), and lm_eval_harness (lm_eval_config). This software and/or data was deposited in the BAIR Open Research Commons repository in 2025.