How OpenEvidence Built a Healthcare AI That Physicians Trust
Source: Vercel Blog
The Highest Bar There Is
Healthcare is the ultimate stress test for AI applications. Physicians don't tolerate hallucinations—a made-up drug interaction or fabricated clinical study could directly harm patients. Response time matters because decisions happen in real time during consultations. And reliability isn't a nice-to-have; it's a requirement that, if unmet, means the tool simply won't be used.
OpenEvidence set out to build an AI assistant that meets this bar: fast enough for clinical workflows, accurate enough for medical decision-making, and trustworthy enough that physicians actually rely on it.
The Trust Problem
The core challenge isn't technical—it's human. Physicians are trained to be skeptical of information sources. They want to see the evidence, trace the reasoning, and verify claims independently. An AI that says "take this drug" with no supporting evidence is worse than useless—it's dangerous.
OpenEvidence's fundamental design decision was to make citations first-class citizens, not afterthoughts. Every response is grounded in specific medical literature. The system doesn't generate medical knowledge—it retrieves, synthesizes, and presents existing peer-reviewed research. The physician sees not just an answer but the evidence trail behind it.
This is a design philosophy that extends beyond healthcare. Any AI application operating in high-stakes domains needs to answer the question: "Why should I trust this output?" If the answer is "because the AI said so," you've already lost.
Citation-First Design
The citation-first approach shapes every layer of the architecture:
Retrieval Pipeline: Before generating any response, the system retrieves relevant medical literature from curated databases. The retrieval step isn't an optimization—it's the foundation. No relevant literature, no response. This prevents the model from generating plausible-sounding but unsupported claims.
Inline Citations: Every factual claim in a response is linked to its source. Physicians can click through to the original paper, verify the context, and assess the quality of the evidence themselves. This transforms the AI from an oracle into a research assistant.
Confidence Signaling: When the evidence is limited or conflicting, the system explicitly says so. Acknowledging uncertainty is as important as providing answers—physicians need to know when to dig deeper or exercise additional caution.
Recency Awareness: Medical knowledge evolves rapidly. The system tracks the publication dates of its sources and flags when recommendations are based on older literature that may have been superseded.
Technical Architecture
OpenEvidence's stack reflects the dual demands of startup velocity and production reliability:
Frontend (Next.js on Vercel): The user interface needs to be fast, responsive, and accessible in clinical environments—which means everything from hospital desktops to mobile devices on rounds. Next.js provides the server-side rendering performance and the component model for complex citation displays. Vercel's edge network ensures low latency regardless of the physician's location.
Custom LLM Pipeline: The medical reasoning layer isn't a generic chatbot. It's a purpose-built pipeline that orchestrates retrieval, synthesis, and response generation with medical-domain-specific logic. The pipeline enforces citation requirements at the architecture level—it's not possible to generate an uncited claim.
Evidence Database: A continuously updated corpus of medical literature, clinical guidelines, and peer-reviewed research. The quality of this database directly determines the quality of the system's output—garbage in, garbage out applies with particular force in medicine.
Reliability Engineering
For a tool that physicians rely on during patient care, downtime isn't an inconvenience—it's a patient safety issue.
Redundancy: Every critical path has fallback mechanisms. If the primary LLM endpoint is slow or unavailable, the system routes to backup providers. If the evidence database is temporarily unreachable, cached results serve as a bridge.
Continuous Monitoring: Response quality is monitored in real time, not just availability. If the system starts producing responses with fewer citations or lower-quality sources, alerts fire before physicians notice degraded output.
Graceful Degradation: When components fail, the system reduces capability rather than producing unreliable output. A slow evidence database leads to a "still searching" indicator, not a hallucinated response. This is the opposite of how most AI applications handle failures—where degradation often means the guardrails come off.
Load Testing for Reality: Clinical usage patterns are spiky—rounds happen at predictable times, conference presentations drive traffic surges, and new clinical guidelines trigger waves of queries. The infrastructure is tested against these patterns, not just steady-state load.
Scaling Across Medicine
Medicine isn't one domain—it's hundreds of specialties, each with its own literature, terminology, and decision-making patterns. Scaling OpenEvidence means not just handling more users, but handling more kinds of questions across more specialties.
Each specialty brings unique challenges: oncology decisions involve complex multi-drug interactions, cardiology requires understanding of imaging data alongside literature, and pediatric dosing involves weight-based calculations that demand precision. The system needs to handle all of these without becoming a jack-of-all-trades that's unreliable in any specific domain.
The solution involves specialty-specific evidence curation, domain-aware retrieval that understands medical context, and continuous validation with specialist physicians who can assess output quality in their field.
Lessons for Regulated Industries
OpenEvidence's experience offers patterns applicable far beyond healthcare:
Trust is designed, not declared: You can't tell users to trust your AI. You have to build trust into every interaction through transparency, citations, and uncertainty acknowledgment.
Reliability is a feature, not an SLA: In high-stakes domains, the difference between 99.9% and 99.99% uptime isn't a decimal point—it's the difference between a tool people use and a tool people abandon.
Speed and rigor coexist: Startup velocity doesn't require cutting corners on quality. The right architecture—in OpenEvidence's case, Next.js on Vercel with a purpose-built AI pipeline—enables both rapid iteration and production reliability.
Domain expertise is non-negotiable: Building AI for medicine requires deep collaboration with physicians. Building AI for finance requires working with financial professionals. The AI is a tool; domain experts define how it should work.
Saying "I don't know" is a feature: The most dangerous AI is one that always has an answer. Building systems that acknowledge the limits of their knowledge is essential for any high-stakes application.
The bar OpenEvidence has set—evidence-grounded, citation-first, reliability-engineered—should be the standard for AI applications in any domain where the output matters.
