Architecting Multi-Step AI Workflows That Actually Ship

A practical guide to ReAct, Chain-of-Thought, and Tree-of-Thoughts for teams building real systems

Feb 05, 2026

Most AI agents in production look impressive when being written about on LinkedIn. Then they try to execute a three-step workflow and fall apart like a cheap suit.

I've spent the past year watching teams build agents that could dazzle in demos but choked when asked to do anything more sophisticated than answering isolated questions. The problem isn't the models. It's that we're architecting these systems like we're still living in 2022, when a single LLM call was the height of sophistication.

The gap between proof-of-concept and production-ready isn't about finding better models. It's about understanding when your agent needs to think step-by-step, when it needs to react instantly, and when it needs to explore multiple paths before committing. The frameworks exist. Most teams just can't tell when to use which one.

Visit Moltin

The Problem With Single-Shot Thinking

Your typical AI agent takes a prompt, generates a response, and calls it a day. For simple tasks, that's fine. Ask it to summarize an email and you'll get something workable. Ask it to coordinate a project across three departments with conflicting priorities and you'll get creative fiction masquerading as a plan.

Single-shot agents can't maintain state across decisions. They can't backtrack when they realize they've gone down a dead end. They certainly can't explore multiple solution paths and pick the best one. They're fundamentally reactive systems dressed up as intelligent assistants.

This is the dirty secret of most "AI automation" platforms. They're running glorified if-then statements with an LLM wedged in. When conditions get complex, they break. When edge cases appear, they hallucinate. When mistakes compound, they keep going because they have no way to recognize they're off track.

When Linear Thinking Works

Chain-of-Thought prompting was the first serious attempt to make LLMs show their work. Instead of jumping straight to an answer, the model generates intermediate reasoning steps. Think of it as forcing the agent to talk through its logic before committing.

The original research from Jason Wei and colleagues at Google Research showed impressive results. On the GSM8K (Grade School Math 8K) math benchmark, a 540-billion parameter model using Chain-of-Thought achieved state-of-the-art accuracy, surpassing even fine-tuned GPT-3 with a verifier. That's not a small improvement. That's a complete shift in capability.

But here's what the benchmarks won't tell you: Chain-of-Thought only really kicks in at scale. Models with fewer than 100 billion parameters produced reasoning chains that seemed coherent but were actually wrong, leading to worse performance than standard prompting. If you're running smaller models, adding Chain-of-Thought might actually hurt you.

And there's another catch that teams discover too late. Recent research from Wharton's Generative AI Labs tested Chain-of-Thought prompting on modern reasoning models and found minimal benefits. For models like o3-mini and o4-mini, the average improvements were only 2.9% and 3.1% respectively. These models already reason step-by-step internally. Asking them to do it explicitly is like asking a marathon runner to count their steps.

When to Use Chain-of-Thought

Use Chain-of-Thought when your task needs transparent reasoning but follows a mostly linear path. It works for math problems, logical deduction, and cases where you need to audit the agent's thinking. Multi-step calculations. Sequential analysis. Anything where B genuinely follows A.

Don't use it for tasks where the path branches, where backtracking matters, or where you're running bleeding-edge reasoning models that already do this work behind the scenes. And definitely don't use it just because someone told you it's "best practice." The Wharton study found that for many modern models, the gains must be weighed against increased response times and potential decreases in perfect accuracy due to more variability.

One more thing: Chain-of-Thought burns tokens. You're asking the model to generate all that intermediate reasoning, and you're paying for every word. If you're running thousands of queries a day, the cost difference between a terse answer and a Chain-of-Thought response adds up fast.

silhouette photography of person — Photo by Greg Rakozy on Unsplash

When Agents Need to Touch the World

Chain-of-Thought keeps everything in the model's head. ReAct, short for "Reasoning and Acting," lets the agent actually do something about its reasoning. It alternates between thinking and taking actions, using external tools to gather information or execute tasks.

The ReAct framework from Shunyu Yao and colleagues at Princeton and Google Research showed that on interactive decision-making benchmarks, ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively.

Here's how it works in practice. The agent encounters a question it can't answer from knowledge alone. It reasons about what information it needs. It calls a tool to get that information. It incorporates the results into its reasoning. It repeats until it has an answer.

This is huge for production systems because most real work involves touching external systems. You can't schedule a meeting by reasoning about calendars. You can't generate a financial report without pulling actual data. You can't debug code without running it. ReAct gives agents the ability to interact with the world while maintaining a reasoning trace you can actually follow.

ReAct in Production

The catch with ReAct is that it requires infrastructure. You need tools the agent can call. You need a way to handle failures when those tools return garbage. You need guardrails so the agent doesn’t go off on a forty-step tangent when two steps would suffice.

But when you get it right, ReAct is transformative. I’ve seen customer service agents that can look up order history, check inventory, and process refunds without human intervention. Data analysis agents that can query databases, run statistical tests, and generate visualizations. DevOps agents that can read logs, identify issues, and even apply fixes.

The key is understanding that ReAct isn’t just about tool use. It’s about maintaining context across actions. Each step informs the next. The agent isn’t just executing a script. It’s adapting based on what it learns.

One warning: without proper constraints, ReAct agents can spiral. They’ll make a tool call, get a result that suggests another tool call, then another, until you’re fifteen steps deep and have no idea how you got there. You need mechanisms to detect loops and dead ends. You need a way to force the agent to commit to an answer instead of searching forever.

When You Need to Explore

Most problems don’t have one obvious solution path. They have multiple approaches, each with different trade-offs. Chain-of-Thought picks one path and commits. Tree-of-Thoughts evaluates several paths before deciding.

The Tree-of-Thoughts framework from Shunyu Yao and colleagues at Princeton and Google DeepMind showed dramatic improvements over traditional approaches. In the Game of 24 task, while GPT-4 with Chain-of-Thought prompting only solved 4% of tasks, Tree-of-Thoughts achieved a success rate of 74%. That’s not a marginal gain. That’s the difference between unusable and production-ready.

Tree-of-Thoughts works by maintaining multiple reasoning paths simultaneously. The agent generates several potential next steps, evaluates them, picks the most promising ones, and continues from there. If a path hits a dead end, the agent backtracks and tries another branch.

This is computationally expensive. You’re essentially asking the model to explore a search space, and search spaces grow exponentially. But for problems where the initial decision matters, where wrong turns are costly, and where you need to find good solutions rather than just acceptable ones, the cost pays off.

When Tree-of-Thoughts Makes Sense

Use Tree-of-Thoughts for problems that require strategic lookahead. Planning complex workflows. Optimizing resource allocation. Anything where you need to consider multiple options before committing.

Don’t use it for simple queries or real-time systems. The computational overhead kills you. A customer asking for their order status doesn’t need the agent to explore five different ways to look it up. They need an answer now.

Tree-of-Thoughts is also harder to implement than Chain-of-Thought or ReAct. You need search algorithms. You need evaluation functions that can judge partial solutions. You need infrastructure to manage and prune branches. Most teams aren’t ready for this level of complexity unless they’re solving problems where the alternative is manual planning by expensive humans.

Picking Your Architecture

The real question isn’t which framework is best. It’s which architecture matches your problem.

Reactive agents map inputs directly to outputs. They’re fast, predictable, and work great for well-defined tasks. Think chatbots handling common questions, or systems routing tickets to the right department. The environment is stable. The rules are clear. You don’t need planning, just quick responses.

Planning agents build internal models and think multiple steps ahead. They’re essential when decisions have long-term consequences, when the environment is complex, or when you need to coordinate multiple actions. Think autonomous systems managing warehouse operations or agents planning marketing campaigns across channels.

Hybrid agents layer both approaches. Fast reactive responses for common cases. Deeper planning for complex scenarios. This is what most production systems actually need, but it’s also the hardest to build right. You need clear rules for when to react and when to think. You need coordination mechanisms so the layers don’t fight each other.

Gartner predicts that by 2027, more than 40% of agentic AI projects will be canceled as projects fail, costs spike, business value stays fuzzy, and risk controls lag. That’s not because the technology doesn’t work. It’s because teams pick architectures that don’t match their problems.

The Decision Matrix

Start simple. If your task can be solved reactively, don’t add planning overhead. A reactive agent that works beats a planning agent that’s still in development.

Add Chain-of-Thought when you need transparency and the task follows a mostly linear path. You want to see the agent’s reasoning. You need to validate each step. The extra tokens are worth the auditability.

Bring in ReAct when your agent needs to interact with external systems. If the answer depends on real-time data or if the agent needs to execute actions, you need tool use. Don’t try to fake this with prompt engineering.

Consider Tree-of-Thoughts only when wrong decisions are expensive and you have the computational budget to explore alternatives. This is not your starting point. This is where you go when simpler approaches have hit their limits.

Build hybrid architectures when you need both speed and depth. Customer service that handles simple queries instantly but escalates complex issues to deeper reasoning. DevOps agents that react to alerts but plan remediation strategies. The coordination overhead is real, but so is the value of getting both capabilities.

What Actually Ships

The agents that make it to production have a few things in common. They start simple and add complexity only when necessary. They have clear boundaries around what they can and can’t do. They fail gracefully instead of hallucinating their way through uncertainty.

They also have humans in the loop at key decision points. Not because the AI can’t handle it, but because production systems need mechanisms to catch mistakes before they compound. The best agent architectures make it easy for humans to review, override, and learn from edge cases.

Most importantly, shipping agents have measurable success criteria. Not vague notions of “helpfulness.” Concrete metrics tied to business outcomes. Percentage of tickets resolved without escalation. Reduction in manual data entry hours. Increase in workflow completion rate.

If you can’t measure it, you can’t improve it. If you can’t improve it, it won’t ship.

Choose your architecture based on the problem you’re solving, not the paper you read last week. Start with the simplest thing that could work. Add complexity when simpler approaches fail and you understand why. Test against real workflows, not toy examples.

And remember: the point isn’t to build impressive AI. It’s to build AI that actually does the job.

Thanks for reading Hallucinations @Moltin! This post is public so feel free to share it.

Hallucinations @Moltin

Discussion about this post

Ready for more?