Skip to content

Production AI🔗

Why You Should Avoid AI Frameworks

LangChain, LlamaIndex, Haystack, CrewAI, DSPy. The AI ecosystem is drowning in orchestration frameworks. They promise rapid prototyping, clean abstractions, and "production-ready" pipelines.

I've used them. I've shipped with them. And I've ripped them out.

Not because they're bad software. Because they abstract the wrong layer.

The Wrong Abstraction

Here's what matters in production LLM systems:

  1. What goes into the prompt (your context)
  2. How documents are formatted (boundaries, IDs, separators)
  3. What the model actually receives (the full payload)
  4. What comes back (parsing, validation, error handling)

These four things are where 90% of your bugs live. Citation hallucination? That's #2. Token budget blowup? That's #1. Unparseable responses? That's #4.

AI frameworks abstract exactly these four things away from you.

They give you a nice .invoke() or .run() method, and somewhere deep inside, they build a prompt, format your documents, add system instructions, maybe inject few-shot examples, and fire it off to the API. You don't see any of it.

When it works, it's magic. When it breaks, you're blind.

The Debugging Nightmare

Here's a scenario that happens more often than it should.

You have a RAG pipeline in production. Users report wrong answers. You look at the logs. The retrieval step returned the right documents. The model's response looks reasonable. But the citations don't match the content.

You need to see the actual prompt that was sent to the model. The full text. The exact token sequence.

With a thin wrapper around the API, you print(prompt) and you're done. Five seconds.

With a framework, you start a journey. You dig into the chain definition. You find a PromptTemplate that references a Retriever that feeds into a StuffDocumentsChain that wraps a LLMChain. Each layer adds its own formatting. The actual prompt is assembled at runtime, deep inside the framework's call stack.

You can't just print it. You have to either:

  • Add a callback handler and intercept the API call
  • Use a proxy like mitmproxy to sniff the HTTP request
  • Read the framework's source code to understand what it constructs

Hamel Husain wrote the definitive piece on this: Fuck You, Show Me The Prompt. He uses mitmproxy to intercept what frameworks actually send to the API. The results are revealing: unnecessary system messages, bloated prompts, redundant API calls, and formatting that you'd never write yourself.

The time you saved by not writing a prompt template, you spent tenfold debugging the framework's prompt template.

What Frameworks Get Wrong

They hide the prompt. The prompt is your product. It's the single most important artifact in your system. Hiding it behind abstractions is like hiding your SQL queries behind an ORM and wondering why your database is slow. Sometimes you need the ORM. But you always need to be able to see the query.

They add coupling. Your pipeline depends on the framework's release cycle, its breaking changes, its opinions about how documents should be formatted. When they change how StuffDocumentsChain concatenates text, your citation validation breaks and you don't know why.

They make observability harder. Good production systems log every prompt and every response. With raw API calls, this is trivial. With frameworks, you need to hook into their callback system, which may or may not expose what you need.

They optimize for demos, not debugging. A framework that makes a 5-line demo possible but makes a 5-minute debug session take 2 hours has the wrong priorities for production.

What You Should Do Instead

Build thin utility functions that you own and understand completely.

def build_prompt(query: str, docs: list[str], system: str) -> list[dict]:
    """You can see exactly what goes to the model."""
    formatted_docs, id_map = format_docs_for_prompt(docs)

    return [
        {"role": "system", "content": system},
        {"role": "user", "content": f"{formatted_docs}\n\nQuestion: {query}"},
    ], id_map


def call_llm(messages: list[dict], model: str = "gpt-4") -> str:
    """Thin wrapper. Log everything. No magic."""
    logger.info(f"Prompt: {messages}")
    response = client.chat.completions.create(
        model=model,
        messages=messages,
    )
    result = response.choices[0].message.content
    logger.info(f"Response: {result}")
    return result

This is ~20 lines of code. You own it. You can read it. You can debug it. You can log it. You can change how documents are formatted without reading 500 lines of framework source.

The format_docs_for_prompt function from my citation hallucination post is exactly this kind of utility: small, clear, does one thing, and you control the output completely.

When Frameworks Are Okay

I'm not saying never use them.

Use them for:

  • Prototyping and exploration get something working in 10 minutes
  • Learning see how different patterns are structured
  • Hackathons ship fast, throw away later

Don't use them for:

  • Production systems where you need to debug at 2am
  • Pipelines where prompt quality directly impacts revenue
  • Any system where you need to explain to your team what the model actually receives

The rule is simple: if you can't print(prompt) in one line, you've lost control of your system.

The Bottom Line

The AI frameworks market is a solution in search of a problem. The "problem" they solve calling an API and formatting a prompt is not hard. It's 20 lines of Python. What IS hard is making that prompt correct, debugging it when it's wrong, and observing it in production.

Frameworks make the easy part easier and the hard part harder. That's the wrong tradeoff.

Own your context. Own your prompts. Own your pipeline. The abstractions you need are the ones you build yourself.


This post connects to Stop Citation Hallucination in RAG a concrete example of what happens when you don't control how your documents are formatted in the prompt.

Have thoughts on this? Reach out on LinkedIn.

Stop Citation Hallucination in RAG

Your RAG system asks the LLM to cite sources. You use DOC 1, DOC 2, DOC 3. The LLM confidently cites DOC 4 which doesn't exist.

This is not a model bug. It's a prompt design bug, and it compounds fast when you build pipelines on top.

The Problem

Bad: Sequential numbers

DOC 1: Paris is the capital of France
DOC 2: Python is used for ML
DOC 3: Git enables collaboration

The LLM sees 1, 2, 3 and autocompletes the sequence. It hallucinates DOC 4, DOC 5 because that's what autoregressive models do: they predict the next likely token.

Good: Random letter IDs

DOC [XKJM]: Paris is the capital of France
DOC [PLQW]: Python is used for ML
DOC [BNRT]: Git enables collaboration

No pattern to continue. The LLM can't invent DOC [????] because there's no sequence to predict. It's forced to either cite an existing ID or cite nothing.

The Boundary Problem (The Real Mess)

Sequential IDs are only half the problem. The other half is document boundary confusion.

Consider a real e-commerce RAG system. You retrieve product pages, FAQ entries, and support tickets. You separate them with \n. Sounds fine, until you look at what's actually inside the documents:

DOC 1: Nike Air Max 90
Available sizes: 40, 41, 42, 43
Colors: White, Black

Customer reviews:
"Great shoe, very comfortable" - Jean P.
"Runs small, order one size up" - Marie L.

Return policy: 30 days, unworn condition
DOC 2: Adidas Ultraboost 22
Lightweight running shoe with Boost midsole.
Available sizes: 39, 40, 41, 42

Free shipping on orders over 50EUR.
DOC 3: FAQ - How to return an item?
Step 1: Log into your account
Step 2: Go to "My Orders"
Step 3: Click "Return"

Note: Items must be in original packaging.
Refund processed within 5-7 business days.

See the problem? Each document contains its own \n separators. The LLM has no way to know where DOC 1 ends and DOC 2 begins. The customer review, the empty lines, the multi-line FAQ steps they all look like document boundaries.

Now the LLM might cite "DOC 1" when referring to the return policy, which was actually the tail end of DOC 1 bleeding into DOC 2. Or it merges content from two documents and attributes it to one. You can't tell the difference between a correct citation and a broken one because the boundaries themselves are ambiguous.

Why This Kills Your AgentOps Pipeline

If you're building a full agentic pipeline this becomes catastrophic, fast.

Consider a typical agent flow:

User question
    -> Retrieval (fetch docs)
        -> LLM reasoning (cite sources)
            -> Citation validator (check IDs exist)
                -> Response formatter (build answer with references)
                    -> Feedback loop (user clicks citation -> sees source)

When the LLM hallucinates DOC 4:

  • Your citation validator passes because 4 looks like a valid integer, and you might have 4+ docs in your index. You'd need to check against the exact set retrieved for this query, not just "does this ID exist somewhere."
  • Your response formatter links to the wrong document, or crashes on a missing reference.
  • Your feedback loop shows the user a source that doesn't support the answer. Trust destroyed.
  • Your observability logs show a "successful" pipeline run with green checkmarks everywhere. The hallucination is invisible in your metrics.

Now scale this. 100K conversations per day. How many silent citation errors are you shipping? You don't know, because sequential IDs make hallucination look valid.

With random IDs like [XKJM], a hallucinated citation is immediately detectable: the string either exists in your retrieved set or it doesn't. No ambiguity, no edge cases. Your validator becomes a simple set lookup.

Two Fixes

1. Random letter IDs

Use 3-4 random uppercase letters instead of numbers. No sequence to predict. Validation is a trivial set membership check.

2. Mega separators

Use 20 newlines between documents. This sounds absurd, but it works. The massive whitespace gap creates an unambiguous visual boundary that the LLM can't confuse with in-document formatting.

Combined, the same example becomes clean:

DOC [XKJM]: Nike Air Max 90
Available sizes: 40, 41, 42, 43
Colors: White, Black

Customer reviews:
"Great shoe, very comfortable" - Jean P.
"Runs small, order one size up" - Marie L.

Return policy: 30 days, unworn condition




















DOC [PLQW]: Adidas Ultraboost 22
Lightweight running shoe with Boost midsole.
Available sizes: 39, 40, 41, 42

Free shipping on orders over 50EUR.




















DOC [BNRT]: FAQ - How to return an item?
Step 1: Log into your account
Step 2: Go to "My Orders"
Step 3: Click "Return"

Note: Items must be in original packaging.
Refund processed within 5-7 business days.

Now the LLM can't confuse an in-document newline with a document boundary. And it can't hallucinate a citation because there's no pattern to extend.

Implementation

import random
import string
import re


def generate_doc_id(length: int = 4) -> str:
    """Generate a random uppercase letter ID."""
    return ''.join(random.choices(string.ascii_uppercase, k=length))


def format_docs_for_prompt(docs: list[str]) -> tuple[str, dict[str, str]]:
    """Format documents with random IDs and mega separators.

    Returns the formatted string and a mapping of ID -> document
    for downstream validation.
    """
    separator = "\n" * 20
    id_to_doc = {}
    formatted_parts = []

    for doc in docs:
        doc_id = generate_doc_id()
        id_to_doc[doc_id] = doc
        formatted_parts.append(f"DOC [{doc_id}]: {doc}")

    return separator.join(formatted_parts), id_to_doc


def validate_citations(response: str, valid_ids: set[str]) -> list[str]:
    """Extract cited IDs from response and flag any hallucinated ones."""
    cited = set(re.findall(r'\[([A-Z]{4})\]', response))
    hallucinated = cited - valid_ids
    return list(hallucinated)

The id_to_doc mapping makes your citation validator a one-liner. No fuzzy matching, no "does this number fall in range" logic. The cited ID exists or it doesn't.


The Bigger Picture

This is why, as an AI Engineer, you need full control over your context. The prompt is where your bugs live. Citation hallucination, boundary confusion, token waste these are all context management problems.

AI frameworks abstract exactly this layer away from you. They build the prompt, format the documents, assemble the context and you can't see or control what's happening. As Hamel Husain puts it: Fuck You, Show Me The Prompt.

If you want to go deeper on why this matters at scale, read my next post: Why You Should Avoid AI Frameworks. They abstract the wrong things.


Have other RAG debugging tricks? Reach out on LinkedIn.