目录
目录README.md

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Build License

🤗 Dataset

Official repo for Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models.

We propose training a Q-value model to guide action selection for LLM agents in each decision-making step. Our method comprises both training and inference stages. During the training stage, we first use Monte Carlo Tree Search (MCTS) to explore high-quality trajectories, annotating the actions in each step with Q-values. We then construct preference data and train the Q-value model using step-level Direct Policy Optimization (DPO). During inference, the trained Q-value model guides action selection at each decision-making step.

Our method has following features: | Approach | Step Level | Applicable to API-based LLMs | Single Trial | Task Experience Accumulation | |———————————————–|:———-:|:—————————-:|:————:|:—————————-:| | Prompt Strategies: Reflection, Reflexion | ❌ | | or ❌ | ❌ | | Tree Search: LATS, Search-agent | | | ❌ | ❌ | | Fine-tuning: Agent-FLAN, AgentEvol, ETO | ❌ | ❌ | | | | Q-value model enhanced (Ours) | | | | |

🛠️ Environments Setup

WebShop

  1. Move to the WebShop directory:

    cd LLM-Agents-with-Q/webshop
  2. Install WebShop from source and run environment instance locally. Follow the instructions here (https://github.com/princeton-nlp/WebShop)

  3. Install the module dependencies into your environment:

    pip install -r requirements.txt

    HotPotQA

  4. Move to the HotPotQA directory:

    git clone https://github.com/andyz245/LLM-Agents-with-Q && cd LLM-Agents-with-Q/hotpot
  5. Install the module dependencies into your environment:

    pip install -r requirements.txt

🎎Multi-type Agents Support

API-based LLM agents

Set OPENAI_API_KEY environment variable to your OpenAI API key:

export OPENAI_API_KEY=<your key>

Open-source LLM agents

For open-source LLM agents, we adopt the OpenAI-compatible APIs provided by FastChat.

  1. Move to the fastchat directory:

    cd fastchat
  2. Launch the controller of FastChat

    python -m fastchat.serve.controller
  3. Launch the model worker of FastChat ```bash bash start_multiple_vllm_server_from0_Phi3.sh bash start_multiple_vllm_server_from0_Llama31.sh `

kill -9 $(cat logs/llama31-collect-MCTS-worker_pid.txt)

🚀 Training

We use the HotPotQA task as an example, which can be directly transferred to the webshop task.

  1. Collect trajectories with MCTS
cd hotpot
bash scripts/data_collection/collect_trajectories-phi3.sh
kill -9 $(cat logs/llama31-exploration-eval_pid.txt)
  1. Construct step-level preference data
cd hotpot
python hotpot/scripts/construct_preference_data.py
  1. Train Q-value models
bash run-hotpot-Q-value-model.sh

🎮 Inference

Finally, evaluate the agent

cd hotpot
bash scripts/eval/eval-Phi-3_with-Q-epoch1.sh
  • --algorithm: “simple” refers to Greedy Decision-making, while “beam” refers to Guiding Action Selection with Q-value model.
邀请码