System 1 vs. System 2: Bringing Deliberate Thinking to AI

Figure 1: Gemini 2.5 Pro 17-04-2025 – A visually striking image representing the ‘Tree of Thought’

Author’s Note: This article summarizes research from “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” by Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan from Princeton University and Google DeepMind. The original paper was presented at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023).

Introduction

Large language models like GPT-4 have become remarkably capable at solving a wide range of problems, but they still face a fundamental limitation: they generate text token by token, making decisions linearly without the ability to explore multiple paths or backtrack when needed. This is similar to how humans sometimes need to think deeply about problems, explore different options, and occasionally reconsider our approach when we hit a dead end.

A team of researchers from Princeton University and Google DeepMind has proposed a fascinating solution to this constraint called “Tree of Thoughts” (ToT). This approach allows language models to explore multiple reasoning paths simultaneously, evaluate different options, and even backtrack when necessary – much like how humans solve complex problems requiring deliberate thought.

In this article, we’ll explore:

How Tree of Thoughts builds upon and extends existing prompting techniques
The technical framework that makes this possible
Real-world examples showing significant performance improvements
Practical considerations for when and how to use ToT

Whether you’re a developer, researcher, or just curious about the evolving capabilities of AI, understanding Tree of Thoughts offers valuable insights into making language models more effective problem solvers.

From Chain-of-Thought to Tree-of-Thought

To understand why Tree of Thoughts represents an important advancement, let’s first review how we currently prompt LLMs to solve problems:

Figure 2: Schematic illustrating various approaches to problem solving with LLMs. Each rectangle box represents a thought, which is a coherent language sequence that serves as an intermediate step toward problem solving. Adapted by author from Yao et al., (2023).

Standard Input-Output Prompting

The most basic approach is simply asking a language model to solve a problem directly:

Human: Use four numbers (4, 9, 10, 13) and basic operations to reach exactly 24.

AI: (13-9)*(10-4) = 4*6 = 24

This works for simple problems, but fails when complex reasoning is required, as the model must generate the entire solution in a single pass without exploring alternatives.

Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting, introduced in 2022, improved this by asking models to show their work:

Human: Use four numbers (4, 9, 10, 13) and basic operations to reach exactly 24.

AI: Let me solve this step by step:

1. First, I can calculate 13-9 = 4

2. Next, 10-4 = 6

3. Finally, 4*6 = 24

Therefore, (13-9)*(10-4) = 24

This approach dramatically improved performance on reasoning tasks by giving the model space to work through intermediate steps. However, it still operates linearly – the model commits to each step without considering alternatives or backtracking when stuck.

Self-Consistency with CoT

An extension of CoT is to sample multiple thought chains and take the most frequent answer. This adds diversity but still lacks deliberate exploration and evaluation of promising paths.

Tree of Thoughts: A New Paradigm

Tree of Thoughts expands these approaches by structuring problem-solving as an explicit search over a tree:

Figure 3: Schematic illustrating Tree of Thought prompting. Each rectangle box represents a thought, which is a coherent language sequence that serves as an intermediate step toward problem solving. Adapted by author from Yao et al., (2023).

Tree of Thoughts takes problem-solving to a new level by enabling language models to:

Generate multiple possible next steps at each point in the reasoning process
Evaluate how promising each option is using the language model itself
Choose which paths to explore further based on those evaluations
Backtrack and try different paths when a particular approach isn’t working

This approach mirrors how humans solve complex problems—we consider options, evaluate progress, change course when needed, and sometimes return to earlier decision points to try different approaches.

The paper’s authors draw inspiration from human cognition research, particularly the distinction between “System 1” (fast, automatic) and “System 2” (slow, deliberate) thinking. Standard LLM generation resembles System 1, while ToT implements a System 2-like process of exploring a combinatorial problem space.

How Tree of Thoughts Works

Figure 4: Demonstration of how Tree of Thought works

1. Thought Decomposition

First, we need to decide how to break down the reasoning process into meaningful “thoughts.” Unlike CoT, which doesn’t explicitly define the granularity of thoughts, ToT carefully chooses appropriate units for each problem:

For Game of 24, a thought is a single equation step (e.g., “13-9=4”)
For Creative Writing, a thought might be a paragraph-length writing plan
For Mini Crosswords, a thought could be filling in a specific word

The key is finding the right size – thoughts should be small enough that the model can generate diverse, high-quality candidates, but large enough that the model can meaningfully evaluate their progress toward solving the problem.

2. Thought Generation

For each state in the reasoning process, ToT generates multiple candidate next thoughts. This can happen in two ways:

# Sample independent thoughts

thoughts = [model.generate(prompt + previous_thoughts) for _ in range(k)]

# Or propose multiple thoughts at once

thoughts = model.generate(prompt + previous_thoughts + "List k different approaches:")

This creates branches in our reasoning tree, allowing exploration of different approaches simultaneously.

3. State Evaluation

This is perhaps the most innovative aspect of ToT. Instead of requiring pre-programmed heuristics to evaluate progress, ToT uses the language model itself to assess which paths are promising:

# Value-based evaluation

def evaluate_state(state):

    evaluation_prompt = f"Evaluate if this approach can solve the problem: {state}"

    assessment = model.generate(evaluation_prompt)

    return convert_to_score(assessment)  # e.g., "Very promising" → 0.9

# Voting-based evaluation

def vote_best_state(states):

    comparison_prompt = f"Compare these approaches and select the most promising: {states}"

    selection = model.generate(comparison_prompt)

    return extract_selected_state(selection)

This allows the model to deliberate about its own reasoning process, asking questions like “Can these numbers reach 24?” or “Is this writing plan coherent?”

For Game of 24, the evaluator might determine:

"13-9=4 (left: 4, 4, 10)" → "Sure, we can reach 24 with 4, 4, 10"
"10-4=6 (left: 6, 9, 13)" → "Sure, we can reach 24 with 6, 9, 13"
"4+9=13 (left: 10, 13, 13)" → "Impossible to reach 24 with these numbers"

4. Search Algorithm

Finally, ToT uses classical search algorithms to navigate the tree:

Breadth-First Search (BFS) explores the most promising states at each level before moving deeper
Depth-First Search (DFS) explores promising paths fully before backtracking

These algorithms incorporate lookahead and backtracking, allowing the model to make more global decisions rather than committing to the first path it generates.

Here’s a simplified implementation of ToT with BFS:

def tree_of_thoughts_bfs(problem, model, breadth=5, max_steps=3):

    # Start with initial state

    states = [problem]

    # For each step in our reasoning

    for step in range(max_steps):

        next_states = []

        # For each current state

        for state in states:

            # Generate candidate next thoughts

            candidates = generate_thoughts(state, model, k=breadth)

            # Evaluate candidates

            scored_candidates = [(c, evaluate_state(c, model)) for c in candidates]

            # Keep the most promising ones

            best_candidates = sorted(scored_candidates, key=lambda x: x[1], reverse=True)[:breadth]

            next_states.extend([c for c, _ in best_candidates])

        # Update states for next iteration

        states = next_states

    # Return best final state

    return max(states, key=lambda s: evaluate_state(s, model))

This framework is remarkably flexible, allowing different instantiations for different problem types.

Real-World Applications and Results

The paper demonstrates the effectiveness of Tree of Thoughts on three challenging tasks that require different types of reasoning and planning. Let’s examine each one:

Game of 24: Mathematical Reasoning

The Game of 24 challenges you to use four numbers and basic arithmetic operations to reach exactly 24. For example, with (4, 9, 10, 13), one solution is (13-9)*(10-4) = 24.

This task requires mathematical reasoning and exploring different combinations of operations. The traditional left-to-right generation of language models struggles here because early mistakes cascade—if your first step leads to a dead end, you can’t recover.

Results:

Figure 5: Success rate of different prompting techniques in Game of 24

GPT-4 with standard prompting: 7.3% success rate
GPT-4 with Chain-of-Thought: 4.0% success rate
GPT-4 with Tree of Thoughts: 74% success rate

The dramatic improvement occurs because ToT can evaluate intermediate equations, recognize dead ends, and backtrack to try different approaches. For example, after generating “13-9=4”, it might also consider “4+9=13” and compare which path is more promising.

Creative Writing: Open-Ended Generation

This task involved creating a coherent four-paragraph passage where each paragraph had to end with a specified random sentence. This challenges the model’s ability to plan ahead and maintain coherence.

Results:

GPT-4 coherency scores: IO (6.19), CoT (6.93), ToT (7.56)
Human preference study: Humans preferred ToT outputs over CoT by a ratio of 41:21

In this task, ToT first generates multiple writing plans, evaluates which one is most promising, then generates multiple full passages based on the selected plan, and finally selects the best one. This two-level planning approach mimics how human writers often work—planning before writing, and revising when necessary.

Mini Crosswords: Constrained Search

The crossword puzzle task represents a classic constraint satisfaction problem. The model needs to fill a 5×5 grid where words must satisfy both horizontal and vertical clues.

Results:

IO/CoT methods: ~15% word success rate, rarely solving entire puzzles
ToT: 60% word success rate, solving 20% of puzzles completely

For crosswords, ToT uses depth-first search with backtracking—filling in words one by one, evaluating if the current state makes remaining words impossible to fill, and backtracking when necessary. This is almost exactly how humans solve crosswords: trying a word, seeing if it conflicts with other constraints, and changing it if necessary.

What Makes Tree of Thoughts Special?

What’s fascinating about ToT is how it implements human-like problem-solving strategies with language models:

It mirrors human deliberate reasoning: The ability to consider multiple options, plan ahead, and revise approaches when needed reflects how humans tackle complex problems.
The model evaluates itself: Rather than requiring external evaluation mechanisms, ToT uses the language model to assess its own progress.
It combines neural and symbolic approaches: ToT bridges neural network-based language models with classical symbolic AI search algorithms.
It’s adaptable to different problem types: By adjusting the thought decomposition, generation strategy, and search algorithm, ToT can be adapted to various reasoning tasks.

Putting Tree of Thoughts to the Test

Game of 24: A Test Case with Claude 3.7 Sonnet

The article used the Game of 24 with the numbers (4, 5, 6, 10). The goal is to use all four numbers exactly once with basic arithmetic operations to reach exactly 24.

What Standard Approaches Produced

Interstingly, when using standard input-output prompting and Chain-of-Thought, the model quickly found a correct solution.

IO prompting: “Use the numbers 4, 5, 6, 10 to obtain exactly 24 using addition, subtraction, multiplication, and division. You must use each number exactly once.”

CoT prompting: “Use the numbers 4, 5, 6, and 10 to obtain exactly 24 using addition, subtraction, multiplication, and division. You must use each number exactly once. Please solve this step-by-step, showing your reasoning as you explore different combinations. After each step, note which numbers remain and continue until you find a solution that equals 24.”

Tree of Thoughts Implementation

A simplified version of ToT was implemented:

Generated multiple potential operations at each step
- Prompt: “Given the numbers [4, 5, 6, 10], your goal is to reach exactly 24. Propose THREE different next operations I could perform (using only +, -, *, /). For each operation, show the calculation and the remaining numbers.”
Evaluated how promising each path was
- Prompt: “On a scale from 1-10, how promising is each approach to reach exactly 24? Give your rating and a brief explanation.”
Explored the most promising paths first
- Prompt: “Choose the most promising path and give 3 options for the next step, and evaluate each option.”
Backtracked when necessary
- Prompt: “Go back to explore other initial approaches or combinations.”

The Tree of Thoughts approach likewise reached a correct answer. However, it took significantly longer, requiring multiple steps of generating options, evaluating them, and following the most promising paths.

The Reality Check

This highlights an important observation: as language models continue to improve, simpler approaches like standard prompting and Chain-of-Thought are becoming increasingly effective. The latest models can often solve problems correctly without needing the extensive deliberation process that ToT provides.

This raises an interesting point about the evolution of problem-solving methods. The paper was published at a time when models might have benefited more from structured exploration. As models advance, the performance gap between different prompting techniques may narrow for many tasks.

Honest Assessment

The testing revealed that ToT shines in specific scenarios:

Problems with multiple valid paths that need exploration
Situations where early decisions can lead to dead ends requiring backtracking
Tasks where systematic evaluation of options is beneficial

However, ToT isn’t a silver bullet:

It uses significantly more computational resources
For problems with straightforward solutions, it may be unnecessarily complex
As models improve, simpler approaches are often sufficient
Its effectiveness depends heavily on well-designed prompts and evaluation mechanisms

Real-World Takeaways

Choose the right problems: ToT is most valuable for complex reasoning tasks that benefit from exploration and backtracking.
Balance cost and benefit: The increased computational cost might be justified for critical problems where accuracy is paramount.
Careful prompt engineering matters: The quality of generated thoughts and evaluations depends heavily on well-crafted prompts.
Transparent tracking is essential: Keeping careful track of state, especially for math problems, is crucial for reliable results.
Evaluation can be inconsistent: Sometimes the model would overestimate dead-end paths or underestimate good ones. Averaging multiple evaluations might help.

For developers working with LLMs, ToT represents another useful tool in the toolkit—not a replacement for other approaches, but a valuable complement for the right types of reasoning challenges.

The key insight is knowing when to deploy which approach. For simple problems with clear paths to solutions, standard prompting or Chain-of-Thought may be more efficient. Reserve ToT for complex reasoning tasks where the ability to explore multiple paths and backtrack is truly needed—especially for particularly challenging problems that even advanced models struggle with using simpler approaches.

Limitations and Future Directions

While the results are impressive, there are some important considerations:

Computational cost: ToT requires significantly more computation than standard prompting, running many more model calls to explore different paths.
Problem-specific design: Each application requires carefully designing the thought structure and evaluation prompts.
Not always necessary: For simpler tasks where standard prompting already works well, the additional complexity of ToT may not be justified.

The Tree of Thoughts approach also opens several exciting research directions:

1. Advanced Search Algorithms

The current paper explores only basic BFS and DFS algorithms. Future work could integrate more sophisticated approaches:

Monte Carlo Tree Search (like AlphaGo)
A* search with more sophisticated heuristics
Genetic algorithms for evolving multiple solution candidates

2. Hybrid Approaches

Combining ToT with other techniques could be powerful:

Retrieval-augmented ToT for knowledge-intensive tasks
Tool-using agents that incorporate ToT for planning
Multi-agent ToT where different LMs specialize in generation vs. evaluation

3. Training Improvements

Instead of using ToT only at inference time, future models could be trained to:

Generate diverse, high-quality candidate thoughts
Accurately evaluate the promise of different reasoning paths
Learn domain-specific search heuristics

4. Application to New Domains

ToT could be extended to new problem domains:

Programming (exploring different implementation approaches)
Scientific discovery (generating and testing hypotheses)
Strategic planning (business strategy, game playing)
Dialogue (planning conversational strategies)

Conclusion: A New Paradigm for Problem-Solving with LLMs

As language models continue to advance, frameworks like Tree of Thoughts will be crucial for unlocking their full problem-solving potential. By augmenting the “System 1” capabilities of LLMs with “System 2” deliberate reasoning, we’re moving closer to AI systems that can tackle the kinds of complex problems that have traditionally required human thought.

Key Takeaways

Beyond linear thinking: ToT breaks free from the left-to-right generation constraints of traditional LLM prompting, allowing models to explore multiple possibilities and backtrack when needed.
LMs as their own critics: Rather than requiring external evaluation, ToT leverages the language model’s own capabilities to judge which reasoning paths are most promising.
Flexibility and adaptability: The framework can be adapted to diverse problem types by changing the thought granularity, evaluation criteria, and search algorithms.
Computational tradeoff: While ToT requires more computation than standard prompting, it often achieves better results with fewer total tokens than brute-force approaches.

What’s Next?

If you’re interested in implementing Tree of Thoughts for your own applications:

Start with a well-defined problem where standard prompting struggles
Design appropriate thought units for your domain
Create evaluation prompts that reliably assess progress
Experiment with different search strategies

The authors have released code to help get started, and the approach is flexible enough to adapt to many different use cases.

The Medium version of this article can be found at this link.

References

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36, 11809-11822.

About the Author

Yanran Luo

Intern at Research Graph Foundation | + posts

Tagged artificial intelligence, mathematical reasoning, prompting

Tree of Thoughts: A New Way to Unlock Problem-Solving in Large Language Models

System 1 vs. System 2: Bringing Deliberate Thinking to AI

Introduction

From Chain-of-Thought to Tree-of-Thought

Tree of Thoughts: A New Paradigm

How Tree of Thoughts Works

Real-World Applications and Results

What Makes Tree of Thoughts Special?

Putting Tree of Thoughts to the Test

Game of 24: A Test Case with Claude 3.7 Sonnet

What Standard Approaches Produced

Tree of Thoughts Implementation

The Reality Check

Honest Assessment

Real-World Takeaways

Limitations and Future Directions

1. Advanced Search Algorithms

2. Hybrid Approaches

3. Training Improvements

4. Application to New Domains

Conclusion: A New Paradigm for Problem-Solving with LLMs

Key Takeaways

What’s Next?

References

About the Author

Yanran Luo