A client came to us with a straightforward ask: build an intake chatbot that could read uploaded PDFs, extract key fields, and route the conversation based on what it found. We had built similar things before. What surprised us was how much of the logic we handed directly to Claude rather than coding it ourselves, and how rarely that decision came back to bite us.
That project pushed us to properly evaluate where Claude and Claude Code belong in a production stack, and where they do not.
If you have only used Claude Code as a tab-completion tool inside your editor, you are using about 20% of it. The more interesting workflow is giving it a scoped problem with real context: a file structure, a clear goal, and a constraint or two.
We routinely hand it tasks like “refactor this webhook handler to use async/await and add error logging that matches our existing pattern in utils/logger.ts.” It reads the existing file, infers the logging convention, and produces something we can actually commit without a full rewrite. That is not magic; it is good pattern matching on the context you feed it.
What it does less well: anything that requires understanding state across a long-running system it cannot see. If your architecture has side effects spread across microservices, Claude Code will make locally correct decisions that break something three steps downstream. You still need a human who holds the full mental model.
Most chatbot projects we take on spend the first phase building intent classification logic. Someone writes a long list of regex patterns, or trains a small classifier, or builds a decision tree. It works until the user says something slightly unexpected, and then it falls apart.
With Claude as the reasoning layer, you replace much of that brittle logic with a well-structured prompt and a defined output schema. A 30-person logistics company we worked with had built a customer support bot on a rules engine that needed updates every time they added a new shipping carrier. We rebuilt the classification layer with Claude, gave it a product context document, and it handles novel carrier names it has never seen by reasoning from context. The maintenance burden dropped considerably.
The trade-off is latency and cost. Rules engines are fast and cheap. Claude is neither. For high-volume, low-complexity classification, a fine-tuned smaller model or even keyword matching will often outperform Claude on the cost curve. Know what you are optimizing for.
Building with Claude does not mean only using Anthropic’s products. Here is what a typical automation project looks like for us right now:
We have moved away from trying to build “agents” that run long autonomous loops in production. They are interesting in demos. In practice, a well-scoped chain of discrete Claude calls with human checkpoints at the right moments is more reliable and easier to debug.
One thing Claude Code will not save you from is writing bad prompts. We see this constantly when reviewing other teams’ implementations. Someone hands Claude a vague instruction, gets inconsistent output, and concludes the model is unreliable. Usually the model is doing exactly what it was asked.
The prompts that work in production share a few qualities. They specify a role and a constraint in the first few lines. They include examples of good output, not just a description of what good output looks like. They define what to do when the input is ambiguous, rather than hoping the model guesses right.
System prompts for client-facing chatbots deserve the same review cycle as production code. We version them, we test changes before deploying, and we keep a changelog. The teams that treat prompt editing as a casual activity tend to have chatbots that drift in behavior after every “quick fix.”
Claude is a large language model. It is not a database, a calculator, or a deterministic rule engine. If your use case requires one of those things, wrap a reliable system inside a Claude interface rather than asking Claude to be that system.
Date math is a common trap. Asking Claude to determine whether a subscription is expired based on a date field sounds simple. It often is not. Use code for that, call the result as a tool, and feed the output to Claude for the conversational layer. Hybrid architectures beat pure LLM architectures on reliability for anything involving structured data.
Also worth noting: Claude’s context window is large, but large does not mean infinite, and filling it with undifferentiated text is not the same as giving it useful context. We typically see better results from retrieval-augmented generation that pulls in relevant chunks than from dumping an entire knowledge base into the prompt.
Building with Claude and Claude Code is genuinely faster than what came before, especially for prototyping and for tasks that require reading and generating natural language. The tooling has matured enough that most production blockers are architectural decisions, not model capability gaps.
But the fundamentals have not changed. You still need to understand what the system is doing. You still need to test edge cases. You still need humans at the decision points that matter. The teams getting the most out of this tooling are not the ones treating Claude as a black box they can throw problems at. They are the ones who understand its strengths well enough to build around its weaknesses.