CasePilot: AI-Powered Medical Coding Assistant
Built an AI-assisted medical coding system that combines retrieval, tool orchestration, streaming, and evaluation to help coding professionals work faster with better documentation and compliance support.
- Client
- Coding Ahead
- Industry
- Healthcare Technology
- Duration
- 8 months
- Technologies
- LaravelOpenAI APIVector SearchReal-time StreamingMySQLRedisEvent Broadcasting
Executive Summary
CasePilot is a Laravel-based AI product built for medical coding workflows.
The goal was not to create a generic chatbot. The goal was to help healthcare professionals answer complex coding questions faster by combining multiple knowledge sources, specialized tools, and a workflow that could stand up to real operational use.
The system uses 20+ purpose-built tools, vector search over medical knowledge bases, real-time streaming, and an evaluation loop for quality control. It was designed to support coders with expert-level guidance while respecting the realities of compliance-heavy work.
What changed
- complex coding questions could be worked through in minutes instead of long manual research cycles
- the system could chain the right tools for validation, policy lookup, modifier guidance, and documentation support
- users could see progress in real time instead of waiting on a black-box response
- the product included rate limiting, conversation controls, and evaluation infrastructure for sustainable rollout
The Problem
Medical coding is a good example of where AI needs to behave like a system, not a demo.
The work is complex, high-stakes, and deeply context-dependent.
Professionals need to reason across:
- large code sets such as CPT and ICD-10
- modifier rules and code combinations
- documentation requirements
- changing compliance and reimbursement guidance
- multiple authoritative sources that do not all live in one place
That creates a painful workflow.
A coder or healthcare professional may spend 15 to 30 minutes researching a single complex case, moving between guidance sources, checking combinations, and trying to produce an answer that is both accurate and defensible.
So the problem was not just “answer a question with AI.” It was:
Can we build a product that assembles the right context, uses the right specialized tools, and produces a response strong enough to improve a real coding workflow?
Why This Was a Good AI Workflow
This product fit several conditions I look for in strong AI work:
- the workflow already existed
- the pain was obvious
- the value of faster, better guidance was high
- multiple tools and knowledge sources were required
- quality could be evaluated and improved over time
In other words, this was not a vague “add AI” project. It was a workflow system.
What the System Needed to Do
CasePilot needed to do more than answer in natural language.
It needed to:
- retrieve relevant coding information quickly
- validate code combinations
- pull in compliance guidance where relevant
- surface documentation requirements
- help users understand why a recommendation was being made
- stay responsive enough to feel usable in a real product environment
That meant the product architecture had to include more than a model call.
Product Architecture
The system was built as an AI-assisted workflow with several layers working together:
1. Model layer
The model handled language reasoning and response generation, but within a tightly scoped setup.
class CasePilotAgent
{
public static function make(): PrismBuilder
{
return Prism::text()
->using(Provider::OpenAI, config('prism.providers.openai.default_model'))
->withSystemPrompt(static::getExpertPrompt())
->withTools(static::getSpecializedTools())
->withMaxSteps(6)
->usingTemperature(0.2);
}
}
Notable choices here:
- low temperature for more stable responses
- explicit tool access
- bounded step count to control behavior and cost
2. Context and retrieval layer
The product used vector search across domain-specific medical knowledge bases so the system could pull relevant guidance quickly instead of relying on the model to invent expertise from memory.
That is a big part of why the system could produce stronger answers in a regulated domain.
3. Tool orchestration layer
This is where the system became much more useful than a plain chat interface.
CasePilot had 20+ specialized tools for tasks such as:
- CPT search
- ICD-10 search
- HCPCS search
- dental code search
- code-combination validation
- NCCI policy lookup
- OIG compliance lookup
- documentation gap analysis
- modifier suggestions
Instead of hoping the model would figure everything out in one shot, the system could call the right resources in sequence.
4. Workflow logic layer
Some question types required multiple steps in a deliberate order.
For example, a code validation workflow might:
- validate a code combination
- pull NCCI policy details if conflicts appear
- retrieve modifier details
- gather documentation requirements
That sequencing is part of what made the output feel more trustworthy and complete.
5. UX and streaming layer
A big part of product quality here was response experience.
The system streamed progress and tool activity in real time so users were not left staring at a blank interface during complex processing.
public function createChatStream(Conversation $conversation): void
{
CasePilotAgent::make()
->withMessages($conversation->history_without_tools)
->asStream()
->each(function ($chunk) use ($conversation) {
event(new ChatTokenReceived($conversation->id, $chunk->text));
collect($chunk->toolCalls ?? [])
->each(fn ($toolCall) => event(
new ToolCallStarted($conversation->id, $toolCall->name)
));
});
}
That meant users could see the system working, not just wait for a final answer.
6. Evaluation and control layer
The build also included evaluation logic, conversation limits, rate limits, and logging.
That matters because useful AI products need evidence loops, not just initial launch energy.
class AgentEvaluator
{
public function evaluateAgent(string $agentType, Collection $questions): array
{
return $questions
->chunk(10)
->flatMap(fn ($chunk) => $this->processQuestionBatch($agentType, $chunk))
->pipe(fn ($results) => $this->calculateScore($results->toArray()));
}
}
What Made the Product Useful
Several decisions helped this feel like a real product system instead of a novelty feature.
Tool orchestration over raw generation
The product did not rely on the model alone. It used specialized tools to ground the output and create stronger multi-step reasoning.
Retrieval over vague model confidence
Vector search helped the system work from relevant medical knowledge rather than generic language patterns.
Streaming over dead air
Real-time feedback improved user trust and perceived responsiveness.
Evaluation over wishful thinking
Quality assurance was part of the system, not an afterthought.
Operational controls over unlimited usage
Conversation caps, rate limits, and infrastructure controls helped the system behave sustainably in production.
Business and Workflow Impact
CasePilot aimed to improve both speed and quality in a domain where both matter.
Workflow impact
- reduced coding research time for complex cases
- created a stronger first pass for users working through difficult questions
- improved consistency by consulting the same classes of sources in a repeatable way
- supported documentation and compliance reasoning alongside code lookup
Product impact
- made expert-level assistance more accessible
- created a clearer experience with streaming and tool transparency
- established a product architecture that could keep improving through evaluation
Why This Matters as a Case Study
I include this project because it reflects the kind of AI work I think is worth doing.
This was not about claiming autonomy. It was about building a system around a real workflow.
It had:
- a clear user problem
- domain-specific context
- tool orchestration
- product UX concerns
- operational controls
- evaluation infrastructure
That combination is much closer to what useful AI implementation actually looks like in production.
Technical Challenges Solved
Real-time AI streaming in a product environment
The system needed to keep users engaged during multi-step processing. Event-driven streaming solved that.
Complex tool sequencing
Different query types required different tool paths. The workflow logic had to be deliberate enough to avoid wasted calls and incomplete answers.
Large knowledge base access
Fast retrieval across large, evolving medical sources was a core requirement, not a nice-to-have.
Stateful conversation handling
The product had to preserve useful context across turns while still managing tool results correctly.
Quality assurance at scale
The evaluation system made it possible to keep improving response quality over time.
Final Thought
CasePilot is a good example of how I think about agent-enabled products.
A useful AI feature is rarely just a prompt. It is usually a small system with:
- context assembly
- specialized tools
- workflow logic
- review or transparency mechanisms
- logging and evaluation
- operational constraints that keep it usable beyond the demo
That is the standard I try to build to.