Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Official repo for Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models.

We propose training a Q-value model to guide action selection for LLM agents in each decision-making step. Our method comprises both training and inference stages. During the training stage, we first use Monte Carlo Tree Search (MCTS) to explore high-quality trajectories, annotating the actions in each step with Q-values. We then construct preference data and train the Q-value model using step-level Direct Policy Optimization (DPO). During inference, the trained Q-value model guides action selection at each decision-making step.

Our method has following features: | Approach | Step Level | Applicable to API-based LLMs | Single Trial | Task Experience Accumulation | |———————————————–|:———-:|:—————————-:|:————:|:—————————-:| | Prompt Strategies: Reflection, Reflexion | ❌ | ✔ | ✔ or ❌ | ❌ | | Tree Search: LATS, Search-agent | ✔ | ✔ | ❌ | ❌ | | Fine-tuning: Agent-FLAN, AgentEvol, ETO | ❌ | ❌ | ✔ | ✔ | | Q-value model enhanced (Ours) | ✔ | ✔ | ✔ | ✔ |

🛠️ Environments Setup

WebShop

Move to the WebShop directory:
```
cd LLM-Agents-with-Q/webshop
```
Install WebShop from source and run environment instance locally. Follow the instructions here (https://github.com/princeton-nlp/WebShop)
Install the module dependencies into your environment:
```
pip install -r requirements.txt
```
HotPotQA

Move to the HotPotQA directory:

git clone https://github.com/andyz245/LLM-Agents-with-Q && cd LLM-Agents-with-Q/hotpot

Install the module dependencies into your environment:
```
pip install -r requirements.txt
```

🎎Multi-type Agents Support

API-based LLM agents

Set OPENAI_API_KEY environment variable to your OpenAI API key:

export OPENAI_API_KEY=<your key>

Open-source LLM agents

For open-source LLM agents, we adopt the OpenAI-compatible APIs provided by FastChat.

Move to the fastchat directory:
```
cd fastchat
```
Launch the controller of FastChat
```
python -m fastchat.serve.controller
```
Launch the model worker of FastChat ```bash bash start_multiple_vllm_server_from0_Phi3.sh bash start_multiple_vllm_server_from0_Llama31.sh `

kill -9 $(cat logs/llama31-collect-MCTS-worker_pid.txt)

🚀 Training

We use the HotPotQA task as an example, which can be directly transferred to the webshop task.

Collect trajectories with MCTS

cd hotpot
bash scripts/data_collection/collect_trajectories-phi3.sh

kill -9 $(cat logs/llama31-exploration-eval_pid.txt)

Construct step-level preference data

cd hotpot
python hotpot/scripts/construct_preference_data.py

Train Q-value models

bash run-hotpot-Q-value-model.sh

🎮 Inference

Finally, evaluate the agent

cd hotpot
bash scripts/eval/eval-Phi-3_with-Q-epoch1.sh

--algorithm: “simple” refers to Greedy Decision-making, while “beam” refers to Guiding Action Selection with Q-value model.