The diffusion process of our conditional diffusion language model DiffuSeq.
The diffusion process of accelerated DiffuSeq.
Highlights
Our proposed DiffuSeq as a conditional language model is trained end-to-end in a classifier-free manner.
We establish a theoretical
connection among AR, NAR and DiffuSeq models (refer to our original paper).
DiffuSeq is a powerful model for text
generation, matching or even surpassing competitive AR, iterative NAR,
and large-PLMs on quality and diversity.
Our study addresses promising achievements by such a new
sequence-to-sequence learning paradigm.
Update: Our enhanced version effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster, rendering it significantly closer to practical application.
Setup:
The code is based on PyTorch and HuggingFace transformers.
pip install -r requirements.txt
Datasets
Prepare datasets and put them under the datasets folder. Take datasets/CommonsenseConversation/train.jsonl as an example. We use four datasets in our paper.
--dataset: the name of datasets, just for notation
--data_dir: the path to the saved datasets folder, containing train.jsonl,test.jsonl,valid.jsonl
--seq_len: the max length of sequence $z$ ($x\oplus y$)
--resume_checkpoint: if not none, restore this checkpoint and continue training
--vocab: the tokenizer is initialized using bert or load your own preprocessed vocab dictionary (e.g. using BPE)
It will take 2 more days to train a DiffuSeq model on 4 NVIDIA A100 80G GPUs for QG and QQP, and the training steps should be increased accordingly along with the size of the training set. To reproduce the results of Table 1 in our paper, we suggest the following configuration for each dataset when training.
Update:
Additional argument:
--learned_mean_embed: set whether to use the learned soft absorbing state.
--denoise: set whether to add discrete noise
--use_fp16: set whether to use mixed precision training
--denoise_rate: set the denoise rate, with 0.5 as the default
It only take around 11 hours to train a model on 2 NVIDIA A100 80G GPUs for QQP.
Empirically, larger batchsize (larger microbatch here) can achieve higher BLEU score (without MBR). If you want to sync training loss to wandb, please customize your wandb setting in train.py (add your own API KEY).
DiffuSeq Decoding
You need to modify the path to model_dir, which is obtained in the training stage.
cd scripts
bash run_decode.sh
To reproduce the results of Table 1 in our paper, we suggest the size of MBR candidate set to be 10 (run 10 times using different seeds). Empirically, larger size can achieve higher BLEU score. For diversity metrics, the size of MBR candidate set is 3 when computing.
Speed-up Decoding
We customize the implementation of DPM-Solver++ to DiffuSeq to accelerate its sampling speed.
cd scripts
bash run_decode_solver.sh
Evaluation & MBR
You need to specify the folder of decoded texts. This folder should contain the decoded files from the same model but sampling with different random seeds. If mbr is not attached, we will compute the diversity score from the files in the folder, otherwise we will do MBR decoding:
cd scripts
python eval_seq2seq.py --folder ../{your-path-to-outputs} --mbr
Note: if you want to use this evaluation script for output files from other models, please make sure the same line from these output files refers to the same piece of data. Otherwise the diversity score could be incorrect.
Update
Update 10 Oct 2023: We update the DiffuSeq-v2, targeting the training/sampling speed up. Details in new branch diffuseq-v2.
Update 22 May 2023: We prepare the checkpoint and sampling results for remaining tasks in this link.
Update 28 Nov 2022: We prepare the checkpoint and sampling results of 10 seeds for QQP dataset in this link.
Update 14 Feb 2023: We update the evaluation scripts and camera ready version of the paper.
Welcome to discuss if you have any questions.
Citation
Please add the citation if our paper or code helps you.
@inproceedings{gong2022diffuseq,
author = {Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng},
booktitle = {International Conference on Learning Representations, ICLR},
title = {{DiffuSeq}: Sequence to Sequence Text Generation with Diffusion Models},
year = 2023
}
@article{gong2023diffuseqv2,
title={DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models},
author={Gong, Shansan and Li, Mukai and Feng, Jiangtao and Wu, Zhiyong and Kong, Lingpeng},
journal={arXiv preprint arXiv:2310.05793},
year={2023}
}
Official Codebase for DiffuSeq: Sequence to Sequence Text Generation With Diffusion Models and DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models.
The diffusion process of our conditional diffusion language model DiffuSeq.
The diffusion process of accelerated DiffuSeq.
Highlights
Our study addresses promising achievements by such a new sequence-to-sequence learning paradigm.
Update: Our enhanced version effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster, rendering it significantly closer to practical application.
Setup:
The code is based on PyTorch and HuggingFace
transformers
.Datasets
Prepare datasets and put them under the
datasets
folder. Takedatasets/CommonsenseConversation/train.jsonl
as an example. We use four datasets in our paper.DiffuSeq Training
Arguments explanation:
--dataset
: the name of datasets, just for notation--data_dir
: the path to the saved datasets folder, containingtrain.jsonl,test.jsonl,valid.jsonl
--seq_len
: the max length of sequence $z$ ($x\oplus y$)--resume_checkpoint
: if not none, restore this checkpoint and continue training--vocab
: the tokenizer is initialized using bert or load your own preprocessed vocab dictionary (e.g. using BPE)It will take 2 more days to train a DiffuSeq model on 4 NVIDIA A100 80G GPUs for QG and QQP, and the training steps should be increased accordingly along with the size of the training set. To reproduce the results of Table 1 in our paper, we suggest the following configuration for each dataset when training.
Update:
Additional argument:
--learned_mean_embed
: set whether to use the learned soft absorbing state.--denoise
: set whether to add discrete noise--use_fp16
: set whether to use mixed precision training--denoise_rate
: set the denoise rate, with 0.5 as the defaultIt only take around 11 hours to train a model on 2 NVIDIA A100 80G GPUs for QQP.
Empirically, larger batchsize (larger
microbatch
here) can achieve higher BLEU score (without MBR). If you want to sync training loss to wandb, please customize your wandb setting intrain.py
(add your own API KEY).DiffuSeq Decoding
You need to modify the path to
model_dir
, which is obtained in the training stage.To reproduce the results of Table 1 in our paper, we suggest the size of MBR candidate set to be 10 (run 10 times using different seeds). Empirically, larger size can achieve higher BLEU score. For diversity metrics, the size of MBR candidate set is 3 when computing.
Speed-up Decoding
We customize the implementation of DPM-Solver++ to DiffuSeq to accelerate its sampling speed.
Evaluation & MBR
You need to specify the folder of decoded texts. This folder should contain the decoded files from the same model but sampling with different random seeds. If
mbr
is not attached, we will compute the diversity score from the files in the folder, otherwise we will do MBR decoding:Note: if you want to use this evaluation script for output files from other models, please make sure the same line from these output files refers to the same piece of data. Otherwise the diversity score could be incorrect.
Update
diffuseq-v2
.Welcome to discuss if you have any questions.
Citation
Please add the citation if our paper or code helps you.
DiffuSeq poster for ICLR 2023.