Why I Built an AI Tax Assistant That Actually Cites Its Sources

Posted on Sat 07 February 2026 in AI Tool Orchestration

Ask ChatGPT a tax question. Go ahead. I'll wait.

Did it give you a confident answer? Good. Now try to verify it. Where did that information come from? What IRS publication? What tax year?

You can't verify it because the AI doesn't know either. It's pattern matching across the internet's collective understanding of taxes, which includes outdated blog posts, forum speculation, and content written by people who got it wrong.

This is the hallucination problem applied to something that matters: your money.

I built a different kind of system. One that only answers from authoritative sources and shows you exactly where every answer comes from.

Try the Live Demo → First load may take 30-60 seconds while the server wakes up.

The Problem With General Purpose AI and Specialized Knowledge

Large language models are trained on the internet. The internet contains a lot of tax information. Most of it is:

Outdated: Tax law changes yearly
Oversimplified: "You can deduct your home office!" (Maybe. Under specific conditions. For certain taxpayer categories.)
Contextually wrong: Advice for W-2 employees applied to 1099 contractors
Confidently stated: The AI doesn't hedge because the training data didn't hedge

When a client or business owner asks an AI about deducting medical expenses, they don't need a general overview. They need to know that medical expenses are deductible only to the extent they exceed 7.5% of adjusted gross income, as stated in IRS Publication 502.

General AI gives you summaries. Domain specific AI gives you citations.

The Solution: Retrieval Augmented Generation (RAG)

RAG is the pattern that makes AI trustworthy for specialized knowledge.

The shift: Don't ask the AI what it knows. Give it your documents and ask it to read them for you.

📄

Ingest Docs

→

🧩

Chunk & Embed

→

🔍

→

🤖

Generate

→

📎

Cite Sources

The AI can't hallucinate tax advice because it's constrained to the source material. Click any citation to verify the answer yourself.

The Architecture: Design Decisions That Matter

Building a RAG system means making choices at every layer. Here's how I thought through each one.

View the Full Architecture Diagram

The Embedding Question: How do you convert text into something a computer can search semantically? I went with OpenAI's text-embedding-3-small. It's fast, accurate, and the API means no GPU infrastructure to manage. The tradeoff is API costs at scale, but for a demo proving the pattern, operational simplicity won.

The Vector Store Decision: Once you have embeddings, where do you store and search them? Qdrant Cloud. Managed infrastructure, generous free tier, and a clean API. Could I have used Pinecone? Sure. Postgres with pgvector for a more integrated stack? Absolutely. The choice matters less than understanding that this layer is about similarity search, and any tool that does similarity search well will work.

The LLM Choice: Claude handles response generation. It receives retrieved context plus the user's question and synthesizes an answer. The prompt engineering matters here: explicit instructions to answer only from provided sources, to cite specifically, to acknowledge when the sources don't contain relevant information. This is where hallucination prevention lives.

The Frontend Reality: React with TypeScript because I wanted type safety and a component model that scales. Tailwind because life is too short for CSS debugging. Vite because fast builds make development pleasant. These choices are almost aesthetic. The important thing is that the frontend handles streaming responses so users see answers generating in real time.

The Deployment Constraint: Render hosts the demo in a Docker container. I originally tried local embedding models (SentenceTransformer) but they require nearly a gigabyte of RAM just to load. Free tier resources are limited everywhere, so I moved to hosted embeddings. This is the kind of tradeoff you discover in production, not in tutorials.

The architecture diagram shows every layer with alternatives. Not because you should swap things out, but because understanding what's swappable helps you see where the actual value lives: the RAG pattern itself.

What You See When You Use It

The interface is intentionally simple. A chat window with suggested questions to get you started.

Ask about home office deductions. Here's what happens behind the scenes:

From Question to Cited Answer

What happens in milliseconds when you ask a tax question

Embed

Your question becomes a 1536-dimension vector, capturing its semantic meaning

Qdrant finds the document chunks closest to your question in vector space

Retrieve

Top matches return with full text and metadata: source document, page, section

Ground

Claude receives your question + retrieved context with strict instructions: answer only from these sources

Stream

The response generates token by token, appearing in real time as you watch

✓

Cite

Clickable source citations appear below, each linking to the exact text used

Click any source. You see the exact text the AI used to formulate its response. No black box. Complete traceability.

The AI can still be wrong about interpretation. But you can verify the source material yourself.

The Technology Stack Explained

For those who want to understand the implementation:

Frontend: React with TypeScript, styled with Tailwind CSS, built with Vite. The chat interface handles Server Sent Events for streaming responses.

Backend: FastAPI running Python 3.11. Modular service architecture separating concerns: embeddings, vector store, LLM, document processing, chat history.

Vector Search: Qdrant Cloud stores document chunks as high dimensional vectors. When a question comes in, we embed it and find the nearest neighbors in vector space. This is semantic search, finding documents by meaning rather than keywords.

LLM: Claude handles the response generation. It receives the user's question plus the retrieved context and generates an answer grounded in those sources.

Embeddings: OpenAI's text-embedding-3-small converts text to 1536 dimensional vectors. This runs on every question and was used during document ingestion.

Persistence: Upstash Redis stores chat history, enabling conversation continuity across sessions.

Deployment: Docker container on Render. The entire application (frontend, backend, static assets) runs in a single container that scales on demand.

Why This Matters Beyond Tax Questions

The pattern here applies to any domain where:

Accuracy matters more than speed
Source attribution is required
The knowledge base is authoritative and bounded
Hallucination has real consequences

Legal documents. Medical protocols. Compliance requirements. Internal policy. Product documentation.

Anywhere you need AI to answer from your sources rather than the internet's collective imagination.

Try It Yourself

Live Demo: ai-tax-assistant-y0yz.onrender.com

Ask about home office deductions
Ask about charitable contribution limits
Ask about medical expense thresholds
Click the source citations to see the underlying text

Architecture Diagram: View the full system design

What's Next

This demo uses sample IRS content. A production system would ingest the full IRS publication library, implement document upload for custom sources, and add more sophisticated chunking strategies.

The foundation is in place. The pattern is proven.

RAG is how you make AI trustworthy for domains where being wrong isn't acceptable.

Building AI systems for your business? I design and implement workflow automations that leverage your authoritative sources. Tell me what you're looking to build

System Version: 2026.1 | Architecture: View Diagram | Demo: Try It