Large Language Models (LLMs) like GPT-4 and Claude are extraordinarily powerful, but they suffer from a fundamental limitation: their knowledge is frozen at the time of training. They cannot access your internal documents, your database, or real-time information. Retrieval-Augmented Generation (RAG) solves exactly this problem by combining the generative power of LLMs with the ability to retrieve information from external sources.
The Problem: LLM Limitations
Before talking about RAG, it's important to understand why we need it.
- Static knowledge: An LLM only knows what it saw during training. If you ask about an event that occurred after its cutoff, it cannot answer.
- Hallucinations: When an LLM doesn't know the answer, it tends to fabricate one, generating plausible but completely false information.
- No access to private data: A generic LLM has no access to your company's internal documentation, tickets, or codebase.
RAG addresses all three of these problems by providing the model with relevant context retrieved from external sources at query time.
What is RAG?
Retrieval-Augmented Generation is an architecture that enriches the prompt sent to an LLM with information retrieved from an external knowledge base. Instead of relying solely on the model's parametric knowledge, RAG searches for relevant information first and then injects it into the prompt, enabling the model to generate accurate, grounded responses.
How RAG Works in Detail
The RAG architecture consists of two main phases: Indexing (offline) and Retrieval + Generation (online).
Phase 1: Indexing (Document Ingestion)
The indexing phase prepares your documents for semantic search. It consists of four steps.
1. Document Loading
Documents can come from any source: PDF files, web pages, databases, Markdown files, APIs. The Document Loader reads these documents and converts them into structured text.
2. Text Splitting (Chunking)
LLMs have a limited context window, and documents can be very long. The Text Splitter divides documents into smaller fragments called chunks. The quality of chunking is critical: chunks that are too small lose context, while chunks that are too large dilute relevance.
The most common strategies are:
- Recursive Character Splitting: Recursively splits text using separators like
\n\n,\n,., respecting the document structure. - Semantic Splitting: Uses embeddings to find natural breakpoints in the text.
- Chunk Overlap: Includes overlap between consecutive chunks to preserve context at boundaries.
3. Embedding
Each chunk is transformed into a numerical vector (embedding) via an embedding model (like OpenAI's text-embedding-3-small). These vectors capture the semantic meaning of the text: sentences with similar meanings will have vectors that are close in multidimensional space.
4. Vector Store
The vectors are saved in a Vector Store (or vector database), such as ChromaDB, Pinecone, Weaviate, or FAISS. This database is optimized for similarity search: given a query, it finds the most similar vectors (and therefore the most relevant text chunks).
Phase 2: Retrieval + Generation
When the user asks a question:
- The question is transformed into an embedding using the same embedding model.
- The Vector Store finds the most similar chunks via similarity search (typically cosine similarity or Euclidean distance).
- The retrieved chunks are inserted into the prompt as context.
- The LLM generates a response based on the provided context.
Building a RAG Pipeline with LangChain
LangChain is the most popular Python (and JavaScript) framework for building LLM-powered applications. It provides high-level abstractions for every component of the RAG pipeline.
Installation
pip install langchain langchain-openai langchain-community chromadb
Step 1: Load Documents
LangChain provides dozens of Document Loaders for different data sources.
from langchain_community.document_loaders import ( PyPDFLoader, WebBaseLoader, DirectoryLoader, TextLoader, ) # Load a PDF pdf_loader = PyPDFLoader("docs/manual.pdf") pdf_docs = pdf_loader.load() # Load a web page web_loader = WebBaseLoader("https://docs.example.com/guide") web_docs = web_loader.load() # Load all .md files from a directory dir_loader = DirectoryLoader("./knowledge_base", glob="**/*.md", loader_cls=TextLoader) md_docs = dir_loader.load() all_docs = pdf_docs + web_docs + md_docs
Step 2: Split Documents into Chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""], ) chunks = text_splitter.split_documents(all_docs) print(f"Original documents: {len(all_docs)}, Chunks: {len(chunks)}")
The chunk_overlap parameter is crucial: it creates overlap between consecutive chunks so that context is not lost at boundaries.
Step 3: Create Embeddings and Vector Store
from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Chroma embedding_model = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents( documents=chunks, embedding=embedding_model, persist_directory="./chroma_db", )
Step 4: Create the Retriever
The retriever is the component that, given a query, fetches the most relevant chunks from the vector store.
retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 4}, ) relevant_docs = retriever.invoke("How does authentication work?") for doc in relevant_docs: print(doc.page_content[:200]) print("---")
Step 5: Build the RAG Chain
Now let's put everything together with an LLM and a prompt template.
from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnablePassthrough from langchain_core.output_parsers import StrOutputParser llm = ChatOpenAI(model="gpt-4o", temperature=0) prompt = ChatPromptTemplate.from_template(""" Answer the question based only on the provided context. If the context does not contain enough information, say you don't know. Context: {context} Question: {question} Answer: """) def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) response = rag_chain.invoke("How does authentication work in the system?") print(response)
Advanced RAG Techniques
The basic pipeline works well, but there are several techniques to significantly improve response quality.
Multi-Query Retrieval
Sometimes the user's query is ambiguous or not aligned with the language used in the documents. The Multi-Query Retriever automatically generates variants of the original question to capture multiple perspectives.
from langchain.retrievers import MultiQueryRetriever multi_retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(), llm=llm, ) docs = multi_retriever.invoke("What are the security best practices?")
Contextual Compression
Not all content in a chunk is relevant to the query. The Contextual Compression Retriever uses an LLM to extract only the pertinent parts from each retrieved chunk.
from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=retriever, )
Hybrid Search
Purely semantic search is not always optimal. Hybrid Search combines semantic search (embeddings) with lexical search (BM25, keyword matching) to achieve better results.
from langchain.retrievers import EnsembleRetriever from langchain_community.retrievers import BM25Retriever bm25_retriever = BM25Retriever.from_documents(chunks) bm25_retriever.k = 4 semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) hybrid_retriever = EnsembleRetriever( retrievers=[bm25_retriever, semantic_retriever], weights=[0.4, 0.6], )
Conversational RAG (with Memory)
To build a RAG chatbot that remembers the conversation context, you need to add memory that reformulates the user's questions taking the conversation history into account.
from langchain.chains import create_history_aware_retriever from langchain_core.prompts import MessagesPlaceholder contextualize_prompt = ChatPromptTemplate.from_messages([ ("system", "Given the chat history and the user's latest question, " "reformulate the question so it is understandable without the history."), MessagesPlaceholder("chat_history"), ("human", "{input}"), ]) history_aware_retriever = create_history_aware_retriever( llm, retriever, contextualize_prompt )
Best Practices
- Choose the right chunk size: Experiment with different sizes (500-1500 tokens). Smaller chunks for precise answers, larger ones for broader context.
- Use document metadata: Add source, date, and category as metadata to chunks. This allows filtering results during retrieval.
- Evaluate quality: Use frameworks like RAGAS to measure metrics such as faithfulness, relevancy, and context precision.
- Handle document updates: Implement a re-ingestion pipeline to keep the vector store synchronized with your data sources.
- Add a re-ranker: After initial retrieval, use a re-ranking model (like Cohere Rerank) to reorder results based on actual relevance.
Conclusion
RAG has become the standard architecture for building AI applications that need access to specific, up-to-date knowledge. LangChain greatly simplifies the implementation, providing abstractions for every component of the pipeline.
Next steps:
- Experiment locally: Start with ChromaDB and a few documents to get familiar with the pipeline.
- Explore LangSmith: Use LangSmith to monitor and debug your chains in production.
- Try different embedding models: Compare models like
text-embedding-3-small,text-embedding-3-large, and open-source models from Sentence Transformers. - Check the documentation: The LangChain documentation is an excellent and constantly updated resource.