Start / Agentic Horizons / Strategist learning strategy with bi level tree search

Strategist: Learning Strategy with Bi-Level Tree Search

8 min • 13 november 2024

This episode focuses on STRATEGIST, a new method that uses Large Language Models (LLMs) to learn strategic skills in multi-agent games1. The core idea is to have LLMs acquire new skills through a self-improvement process, rather than relying on traditional methods like supervised learning or reinforcement learning.

• STRATEGIST aims to address the challenges of learning in adversarial environments where the optimal policy is constantly changing due to opponents' adaptive strategies.

• The method works by combining high-level strategy learning with low-level action planning. At the high level, the system constructs a "strategy tree" through an evolutionary process, refining previously learned strategies.

• This tree structure allows STRATEGIST to search and evaluate different strategies efficiently, eventually arriving at a good policy without needing parameter updates or fine-tuning.How STRATEGIST Learns:

• The learning process relies on simulated self-play to gather feedback. This involves using Monte Carlo tree search (MCTS) and LLM-based reflection to evaluate the effectiveness of different strategies.

• STRATEGIST employs a modular search method that further enhances sample efficiency. This involves two steps:

• Reflection and Idea Generation: The LLM reflects on the self-play feedback and generates ideas for improving the current strategy. These ideas are added to an "idea queue" for later evaluation.

• Strategy Improvement: The LLM selects a strategy from the strategy tree and an improvement idea from the queue, then uses this input to generate an improved version of the strategy. The improved strategy is then evaluated through more self-play simulations.

• This modular approach allows the system to isolate the effects of specific changes and determine which improvements are truly beneficial.

• The idea queue also serves as a memory of successful improvements, which can be transferred to other strategies within the same game.Key Findings:

• The experiments show that STRATEGIST outperforms several baseline LLM improvement methods, as well as traditional reinforcement learning approaches. This suggests that guided LLM improvement, informed by self-play feedback, can be highly effective for learning strategic skills.

• STRATEGIST is also more efficient in acquiring high-quality feedback compared to using an LLM-critic or relying on feedback from interactions with a fixed opponent policy. This highlights the advantage of learning to simulate opponent behavior through self-play.Limitations:

• The authors acknowledge that individual runs of STRATEGIST can have high variance due to the inherent noise of multi-agent adversarial environments and LLM generations. However, they suggest that running more game simulations can mitigate this issue.

• The researchers also note that STRATEGIST hasn't been tested in non-adversarial environments like question answering. However, given its success in complex adversarial settings, similar performance is expected in simpler scenarios.

Conclusion: STRATEGIST represents a promising new approach to LLM skill learning that combines self-improvement with modular search and simulated self-play feedback. The method demonstrates strong performance in challenging multi-agent games, outperforming traditional reinforcement learning and other LLM improvement baselines. The authors believe STRATEGIST's success stems from its ability to (1) effectively test and isolate the impact of specific improvements and (2) explore the strategy space more efficiently to avoid local optima.

https://arxiv.org/pdf/2408.10635

Kategorier

Poddar Teknologi

Förekommer på

Teknik

00:00 -00:00