prompt中考虑在类别名称后面附加文本长度信息,在测试集上的准确率得到一定提升。即将prompt从 A photo of XXX 替换成成 A photo of XXX with xxx_length,其中 XXX表示类别名称,xxx_length表示类别名称的文本长度。例如:A photo of ant with 3 、 A photo of Bernese_mountain_dog with 20等。
最终选择 A photo of XXX and the length of the prompt is xxx_length作为所有类别的文本描述。更多微调的一些方式可参考:generate_prompt
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
@article{zhang2021tip,
title={Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling},
author={Zhang, Renrui and Fang, Rongyao and Gao, Peng and Zhang, Wei and Li, Kunchang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2111.03930},
year={2021}
}
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
@misc{lin2023crossmodal,
title={Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models},
author={Lin, Zhiqiu and Yu, Samuel and Kuang, Zhiyi and Pathak, Deepak and Ramanan, Deva},
year={2023},
eprint={2301.06267},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained Models in Few-Shot Learning
@article{song2023FD,
title={FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained Models in Few-Shot Learning},
author={Kun Song and Huimin Ma and Bochao Zou and Huishuai Zhang and Weiran Huang},
journal={NeurIPS},
year={2023}
}
Learning to Prompt for Vision-Language Models
@article{zhou2022coop,
title={Learning to Prompt for Vision-Language Models},
author={Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei},
journal={International Journal of Computer Vision (IJCV)},
year={2022}
}
Robust fine-tuning of zero-shot models
@article{wortsman2021robust,
title={Robust fine-tuning of zero-shot models},
author={Wortsman, Mitchell and Ilharco, Gabriel and Kim, Jong Wook and Li, Mike and Kornblith, Simon and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Hajishirzi, Hannaneh and Farhadi, Ali and Namkoong, Hongseok and Schmidt, Ludwig},
journal={arXiv preprint arXiv:2109.01903},
note={\url{https://arxiv.org/abs/2109.01903}},
year={2021}
}
第四届计图人工智能挑战赛-赛题一:开放域少样本视觉分类赛题
简介
本项目基于Jittor框架,根据赛题一中的内容和要求,使用预训练的CLIP模型和极少的多领域训练样本完成的一些工作。
比赛数据集由以下四个子数据集构成(Tsinghua-Dog数据集,Caltech-101数据集,Food-101数据集,动物分类自建数据集),共374个类别。对于每个类别,选手可以从训练集中挑出任意4张图片训练自己的模型,当训练结束后,对测试集的每张图片进行分类,输出每张图片的Top5分类。 赛题baseline基于CLIP,本次比赛提供了两个版本的baseline(Training和Training-Free),如下图所示:
安装
本项目可在一张NVIDIA GeForce RTX 3090上运行
运行环境
安装依赖
执行以下命令安装 jittor
安装其他依赖库:
数据预处理
self.ipynb
文件中运行对应的代码即可生成。思路
ViT-B-32.pkl
和RN101.pkl
两个预训练模型在训练集上进行训练和简单的微调,然后选择出效果更好的模型;然后在该模型基础上使用不同的微调方法,如Linear Probe
,Tip-Adapter
等,进行二次训练和微调,最终对不同方法得到的模型集成融合。微调方法
1️、微调Prompt
A photo of XXX
替换成成A photo of XXX with xxx_length
,其中XXX
表示类别名称,xxx_length
表示类别名称的文本长度。例如:A photo of ant with 3
、A photo of Bernese_mountain_dog with 20
等。 最终选择A photo of XXX and the length of the prompt is xxx_length
作为所有类别的文本描述。更多微调的一些方式可参考:generate_prompt
以下是部分微调的一些结果:
2、Linear Probe
3、Tip-Adapter
4️、Cross-modal-Adaptation
5、FD-Align
6️、WiSE-FT
训练
no_finetune.yaml
预训练模型的路径设置成ViT-B-32.pkl
所在路径,运行命令:sh train1.sh
Tip-Adapter-F.yaml
中预训练模型的路径设置成上一个步骤得到的权重所在路径,运行命令:sh train2.sh
推理
test.py
文件中设置你的根目录(root_TrainSet
),pkl_path
和Tip_Adapter
分别为步骤1、2得到的权重路径,运行命令:python test.py
,最终结果文件保存在results
文件夹下。method_name
的值即可。团队成员
引用
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained Models in Few-Shot Learning
Learning to Prompt for Vision-Language Models
Robust fine-tuning of zero-shot models