🎉 What's Changed
Enable configurable thinking in offline cleaning, improve image and gif handling in QA processing, refactor configuration models for cleaner dataset naming, and bump versions and dependencies for release v0.3.02
New Features:
- Introduce enable_thinking flag in LLMCleanConfig to control offline cleaning behavior
- Supporting scoring and cleaning of datasets containing images (assigning the highest score to QA pairs that include images).
Enhancements:
- Refactor cleaned_dataset_name to derive dynamically from original dataset
- Pass enable_thinking through vLLM inference pipeline and adjust repetition_penalty and max_new_tokens accordingly
- Implement CommonMethods to parse dataset names with modality-based suffixes and remove deprecated config fields
Build:
- Bump project version to 0.3.02 and config_version to 0.3.02
- Update dependencies: openai to 1.87.0, vllm to 0.10.0, torch to 2.7.1, add torchvision, transformers to 4.53.2, and triton to 3.3.1
Full Changelog: v0.3.01...v0.3.02
😊 更新内容
在离线清理中启用可配置的“思考”功能,改进问答处理中的图像和 GIF 处理,重构配置模型以实现更清晰的数据集命名,并为发布 v0.3.02 提升版本和依赖项。
新功能:
- 引入
enable_thinking
以控制离线清理行为 - 支持对含有图片的数据集打分清洗(含有图片的qa对赋值最高分)
改进:
- 重构
cleaned_dataset_name
以从原始数据集动态派生 - 将
enable_thinking
传递给 vLLM 推理管道,并相应调整repetition_penalty
和max_new_tokens
- 实现
CommonMethods
以解析带有模态后缀的数据集名称,并移除已弃用的配置字段
构建:
- 将项目版本提升至 0.3.02,配置版本提升至 0.3.02
- 更新依赖项:
openai
至 1.87.0,vllm
至 0.10.0,torch
至 2.7.1,添加torchvision
,transformers
至 4.53.2,以及triton
至 3.3.1
CI:
- 将
pre-commit-hooks
升级至 v6.0.0,ruff
升级至 v0.12.8