Skip to content

[ACM Trans. Multimedia Comput. Commun. Appl.] S^3 Agent: Unlocking the Power of VLLM for Zero-Shot Multi-modal Sarcasm Detection

License

Notifications You must be signed in to change notification settings

WPENGxs/S3Agent

Repository files navigation

S3 Agent: Unlocking the Power of VLLM for Zero-Shot Multi-modal Sarcasm Detection

Table of Contents

News

  • 🎉🎉🎉We have our code ready and released!
  • ✨✨✨After July 12, saved prompts using Gemini 1.0 Pro Vision in Google AI Studio will switch to using Gemini 1.5 Flash. API calls that specify Gemini 1.0 Pro Vision will fail. So if you want to test our code in Gemini, please modify model = genai.GenerativeModel(name='gemini-pro-vision') to model = genai.GenerativeModel(name='gemini-1.5-flash').

Framework

Dataset prepare

Text data

MMSD dataset: ./text/text_json_clean

MMSD 2.0 dataset:./text/text_json_final

Image data

Download the image data from 1, 2, 3, 4, and unzip them into folder ./image

Environment prepare

Base environment

You need to install these packages first:

pip install tqdm pillow scikit-learn

Model environment

Since various models require specific environments, you should install the necessary dependencies accordingly after installing the base environment.

Gemini Pro

pip install google-generativeai

Then you need to fill in the API key from Google AI Studio in gemini_pro.py -> genai.configure(api_key='your api key').

Qwen-VL-Chat

cd env
pip install -r qwen_vl_requirements.txt

Yi-VL-6B

cd env
pip install -r yi_vl_requirements.txt

MiniCPM-V-2

cd env
pip install -r minicpm_requirements.txt

LLaVA-v1.5-7B/13B

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA-main
  1. Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
cd ..

Deepseek-VL

On the basis of Python >= 3.8 environment, install the necessary dependencies by running the following command:

git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL

pip install -e .
cd ..

Run our code

You only need to run the following command in the terminal:

python main.py --dataset dataset_name --model model_name --eval test

# dataset = ['mmsd', 'mmsd2']
# model = ['gemini_pro', 'qwen_vl', 'yi_vl', 'minicpm_v2', 'llava_v1_5', 'deepseek_vl_chat']
# eval = ['test', 'valid']

For Yi-VL, you need to add this line of command before running the command:

CUDA_VISIBLE_DEVICES=0 python main.py --...

Performance

Citation

@article{10.1145/3690642,
author = {Wang, Peng and Zhang, Yongheng and Fei, Hao and Chen, Qiguang and Wang, Yukai and Si, Jiasheng and Lu, Wenpeng and Li, Min and Qin, Libo},
title = {S3 Agent: Unlocking the Power of VLLM for Zero-Shot Multi-modal Sarcasm Detection},
year = {2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1551-6857},
url = {https://doi.org/10.1145/3690642},
doi = {10.1145/3690642},
abstract = {Multi-modal sarcasm detection involves determining whether a given multi-modal input conveys sarcastic intent by analyzing the underlying sentiment. Recently, vision large language models have shown remarkable success on various of multi-modal tasks. Inspired by this, we systematically investigate the impact of vision large language models in zero-shot multi-modal sarcasm detection task. Furthermore, to capture different perspectives of sarcastic expressions, we propose a multi-view agent framework, S3 Agent, designed to enhance zero-shot multi-modal sarcasm detection by leveraging three critical perspectives: superficial expression, semantic information, and sentiment expression. Our experiments on the MMSD2.0 dataset, which involves six models and four prompting strategies, demonstrate that our approach achieves state-of-the-art performance. Our method achieves an average improvement of 13.2\% in accuracy. Moreover, we evaluate our method on the text-only sarcasm detection task, where it also surpasses baseline approaches.},
note = {Just Accepted},
journal = {ACM Trans. Multimedia Comput. Commun. Appl.},
month = aug,
keywords = {Natural language processing, Multi-modal sarcasm detection, Vision large language model}
}

Contact

Please create Github issues here or email Peng Wang, Yongheng Zhang, and Libo Qin if you have any questions or suggestions.

About

[ACM Trans. Multimedia Comput. Commun. Appl.] S^3 Agent: Unlocking the Power of VLLM for Zero-Shot Multi-modal Sarcasm Detection

Resources

License

Stars

Watchers

Forks