Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models
🤗 Dataset
Official repo for Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models.
We propose training a Q-value model to guide action selection for LLM agents in each decision-making step.
Our method comprises both training and inference stages. During the training stage, we first use Monte Carlo Tree Search (MCTS) to explore high-quality trajectories, annotating the actions in each step with Q-values. We then construct preference data and train the Q-value model using step-level Direct Policy Optimization (DPO). During inference, the trained Q-value model guides action selection at each decision-making step.
Our method has following features:
| Approach | Step Level | Applicable to API-based LLMs | Single Trial | Task Experience Accumulation |
|———————————————–|:———-:|:—————————-:|:————:|:—————————-:|
| Prompt Strategies: Reflection, Reflexion | ❌ | ✔ | ✔ or ❌ | ❌ |
| Tree Search: LATS, Search-agent | ✔ | ✔ | ❌ | ❌ |
| Fine-tuning: Agent-FLAN, AgentEvol, ETO | ❌ | ❌ | ✔ | ✔ |
| Q-value model enhanced (Ours) | ✔ | ✔ | ✔ | ✔ |
🛠️ Environments Setup
WebShop
Move to the WebShop directory:
cd LLM-Agents-with-Q/webshop
Install WebShop from source and run environment instance locally. Follow the instructions here (https://github.com/princeton-nlp/WebShop)
Install the module dependencies into your environment:
pip install -r requirements.txt
HotPotQA
Move to the HotPotQA directory:
git clone https://github.com/andyz245/LLM-Agents-with-Q && cd LLM-Agents-with-Q/hotpot
Install the module dependencies into your environment:
pip install -r requirements.txt
🎎Multi-type Agents Support
API-based LLM agents
Set OPENAI_API_KEY
environment variable to your OpenAI API key:
export OPENAI_API_KEY=<your key>
Open-source LLM agents
For open-source LLM agents, we adopt the OpenAI-compatible APIs provided by FastChat.
Move to the fastchat directory:
cd fastchat
Launch the controller of FastChat
python -m fastchat.serve.controller
Launch the model worker of FastChat
```bash
bash start_multiple_vllm_server_from0_Phi3.sh
bash start_multiple_vllm_server_from0_Llama31.sh
`
kill -9 $(cat logs/llama31-collect-MCTS-worker_pid.txt)
🚀 Training
We use the HotPotQA task as an example, which can be directly transferred to the webshop task.
- Collect trajectories with MCTS
cd hotpot
bash scripts/data_collection/collect_trajectories-phi3.sh
kill -9 $(cat logs/llama31-exploration-eval_pid.txt)
- Construct step-level preference data
cd hotpot
python hotpot/scripts/construct_preference_data.py
- Train Q-value models
bash run-hotpot-Q-value-model.sh
🎮 Inference
Finally, evaluate the agent
cd hotpot
bash scripts/eval/eval-Phi-3_with-Q-epoch1.sh
--algorithm
: “simple” refers to Greedy Decision-making, while “beam” refers to Guiding Action Selection with Q-value model.
Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models
🤗 Dataset
Official repo for Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models.
We propose training a Q-value model to guide action selection for LLM agents in each decision-making step. Our method comprises both training and inference stages. During the training stage, we first use Monte Carlo Tree Search (MCTS) to explore high-quality trajectories, annotating the actions in each step with Q-values. We then construct preference data and train the Q-value model using step-level Direct Policy Optimization (DPO). During inference, the trained Q-value model guides action selection at each decision-making step.
Our method has following features: | Approach | Step Level | Applicable to API-based LLMs | Single Trial | Task Experience Accumulation | |———————————————–|:———-:|:—————————-:|:————:|:—————————-:| | Prompt Strategies: Reflection, Reflexion | ❌ | ✔ | ✔ or ❌ | ❌ | | Tree Search: LATS, Search-agent | ✔ | ✔ | ❌ | ❌ | | Fine-tuning: Agent-FLAN, AgentEvol, ETO | ❌ | ❌ | ✔ | ✔ | | Q-value model enhanced (Ours) | ✔ | ✔ | ✔ | ✔ |
🛠️ Environments Setup
WebShop
Move to the WebShop directory:
Install WebShop from source and run environment instance locally. Follow the instructions here (https://github.com/princeton-nlp/WebShop)
Install the module dependencies into your environment:
HotPotQA
Move to the HotPotQA directory:
Install the module dependencies into your environment:
🎎Multi-type Agents Support
API-based LLM agents
Set
OPENAI_API_KEY
environment variable to your OpenAI API key:Open-source LLM agents
For open-source LLM agents, we adopt the OpenAI-compatible APIs provided by FastChat.
Move to the fastchat directory:
Launch the controller of FastChat
Launch the model worker of FastChat ```bash bash start_multiple_vllm_server_from0_Phi3.sh bash start_multiple_vllm_server_from0_Llama31.sh `
🚀 Training
We use the HotPotQA task as an example, which can be directly transferred to the webshop task.
🎮 Inference
Finally, evaluate the agent
--algorithm
: “simple” refers to Greedy Decision-making, while “beam” refers to Guiding Action Selection with Q-value model.