Winston Churchill said “democracy is the worst form of government – except for all the others that have been tried.” I think about this quote when it comes to LLMs and RAG. Lots of tech influencers try to convince people that Retrieval Augmented Generation will solve all of their problems with LLM hallucinations and the like… The actual answer is that RAG really is the worst…
What is Retrieval Augmented Generation (RAG)?
Retrieval augmented generation, otherwise known as RAG, is pretty straightforward. Imagine giving a chatbot or LLM the ability to Google stuff based on your questions and then use the search results in its answers (That's what Microsoft's Bing and Google's generative AI search actually do). When we implement RAG in our own apps, we search our own data (documents, databases, company knowledge, etc.) instead of the internet, feed those results to the LLM, and tell it to use them to answer the question. People use various search techniques like vector databases, embeddings, various indexing approaches, etc, but that's the gist of what most RAG approaches are doing.
RAG is the worst because your chunk size is too small:
When doing RAG, token limits, cost, and performance concerns prevent us from feeding entire documents into the context window. We must selectively include relevant parts to answer the question. This can lead to issues similar to relying solely on search result snippets without accessing the full content.
For example, imagine if you could only search for information by typing a query into a search engine and reading the results page, without clicking through to the actual content. While you could answer many questions this way, the restricted view of the sources can easily lead to misinterpretations and missing crucial context. The search might match sections containing the question but not the answer, or snippets where the author restates, rebuts, or uses sarcasm, potentially causing misunderstandings about the actual meaning of the search results.
In the current implementations of generative search on Bing and Google, the problem of misinterpreting information due to this is very common. For example:
When asked how to move a post in Discord to an existing thread, Google provides a step-by-step guide that includes a non-existent "Move thread" option. When you dig in you discover that the cited source is actually a Discord support thread discussing the hypothetical feature, but without the full context, Google's RAG approach mistakenly presents it as fact. In this case, RAG worsens the issue of "hallucinating" incorrect responses by citing a misleading source that appears helpful but isn't.
RAG is the worst because your chunk size is too big:
Using larger chunks in RAG can provide more context and mitigate some of the above issues, but it introduces new problems. While smaller chunks allow for diverse search results to fit within the model's context window, larger chunks limit the variety of information fed to the LLM. This can cause the model to miss important details as the top search results dominate the context, leading to fixation on particular topics or documents while ignoring the broader relevant sources that didn't make the arbitrary cutoff.
RAG is the worst because LLMs are smart:
RAG can be overkill for many user queries, especially as LLMs improve and models get bigger. The extra latency and cost of searching and injecting context is often unnecessary when the LLM could provide a good answer on its own. Often times clever prompt engineering and latent space activation techniques can achieve similar results to RAG without using up as many tokens. While function calling allows the model to decide whether to respond directly or fall back to RAG, this can lead to unpredictable and inconsistent responses depending on the chosen approach, resulting in users getting varying results for the same question and making testing difficult.
RAG is the worst because LLMs are not smart:
LLMs sometimes struggle with common sense, even when they've been trained on relevant data. It's frustrating when a model gives a flawed answer because the RAG context it's using is inadequate or irrelevant. The model will readily provide an answer based on that limited context, even if it conflicts with the knowledge it has learned during training. By making the LLM rely heavily on a provided "cheatsheet", RAG can lead to "I don't know" or "couldn't find the answer" responses, when an LLM without RAG might have been able to come up with a better reasonable answer on its own.
RAG is the worst because it isn't the magic bullet you think it is:
It seems that the real issue isn't necessarily with RAG itself, but rather with people's attitudes towards new and magical-seeming technologies. When a groundbreaking innovation like RAG emerges, those with limited technical understanding tend to have blind faith in its capabilities, believing it will solve all their problems effortlessly. This is particularly common among executives who often hand-wave away the intricacies of implementing great LLM-powered experiences.
The attitude of "just use RAG" as a magic bullet is dangerous. People assume that simply piping their data into an LLM will automatically provide insights, disregarding the time and effort required to get it right or the need for a multi-pronged approach. When you try to explain that things aren't quite that simple, non-technical executives often lecture you about how AI "should work" based on the latest demo video they saw on social media or in an overly simplified advertisement. (Don't even get me started on how this magical thinking creeps in with even unreleased models and technologies like Q* and GPT-5.)
The real danger lies in the potential creation of poor products and customer experiences by over-relying on an approach with both pros and cons, because people are focusing solely on the pros and ignore all of the cons.
RAG is the worst because it's too simple:
If you look at the advertisements and promises from various RAG proponents one of the unifying messages there is that RAG is easy to set up. Depending on what you go with it may be as easy as hooking your Google Drive up to a SaaS offering or uploading your documents into a file. This demos well when you are doing a simple "talk to your pdf" sort of thing but these naive approaches rarely hold up to concrete use cases mostly because their generic approaches are unable to cope with many of the issues we have already discussed. If you want to build a great app with RAG you'll often need to dive deeper and you'll find that the time you spent on naive approaches that got you fifty percent of the way there was generally time wasted that you should have been investing in building a better custom approach that is more in tune with your data and your needs.
RAG is the worst because it's too complex:
Once you dive into the data and start wanting to support things more complicated than throw away demos you'll find that you have to start doing more complicated stuff if you want to be flexible enough to actually be useful. What are you actually going to do in your bespoke RAG implementation? Semantic search? Which approach? Vector search? Which embedding model? Which vector database? What approach to chunking are you going to do? Are you going to use LangChain? Semantic Kernel? Wait a second I thought all of this was supposed to be point and click... Maybe it's better to just wait for Q* and GPT-5 to solve all of your problems.
RAG is the worst because it is "good enough":
Don't get me wrong. RAG works a good percentage of the time and generally experimentation shows that RAG enabled LLMs consistently outperform LLMs that aren't RAG enabled. That being said, my fear is that teams will over index on RAG as an approach and ignore other approaches for improving LLM results. Approaches like fine-tuning, chain-of-thought, latent space activation, good prompting discipline, context stuffing, and many others can improve the results we get from chatbots and LLMs by significant margins and help them stick closer to the truth. Unfortunately, teams often implement a simple RAG approach that gets them 75% of the way there and then are happy with taking the "C+", getting the product out the door, and releasing a mediocre experience to their customers. Most users aren't really going to know when it's hallucinating anyway, right?