spinny:~/writing $ cat rag-langchain-deep-dive.md

RAG and LangChain: A Complete Guide to Retrieval-Augmented Generation

2026-03-02 · 7 min read · Filippo Spinella · AI, LangChain, LLM, Python

Large Language Models (LLMs) like GPT-4 and Claude are extraordinarily powerful, but they suffer from a fundamental limitation: their knowledge is frozen at the time of training. They cannot access your internal documents, your database, or real-time information. Retrieval-Augmented Generation (RAG) solves exactly this problem by combining the generative power of LLMs with the ability to retrieve information from external sources.

The Problem: LLM Limitations

Before talking about RAG, it's important to understand why we need it.

Static knowledge: An LLM only knows what it saw during training. If you ask about an event that occurred after its cutoff, it cannot answer.
Hallucinations: When an LLM doesn't know the answer, it tends to fabricate one, generating plausible but completely false information.
No access to private data: A generic LLM has no access to your company's internal documentation, tickets, or codebase.

RAG addresses all three of these problems by providing the model with relevant context retrieved from external sources at query time.

What is RAG?

Retrieval-Augmented Generation is an architecture that enriches the prompt sent to an LLM with information retrieved from an external knowledge base. Instead of relying solely on the model's parametric knowledge, RAG searches for relevant information first and then injects it into the prompt, enabling the model to generate accurate, grounded responses.

How RAG Works in Detail

The RAG architecture consists of two main phases: Indexing (offline) and Retrieval + Generation (online).

Phase 1: Indexing (Document Ingestion)

The indexing phase prepares your documents for semantic search. It consists of four steps.

1. Document Loading

Documents can come from any source: PDF files, web pages, databases, Markdown files, APIs. The Document Loader reads these documents and converts them into structured text.

2. Text Splitting (Chunking)

LLMs have a limited context window, and documents can be very long. The Text Splitter divides documents into smaller fragments called chunks. The quality of chunking is critical: chunks that are too small lose context, while chunks that are too large dilute relevance.

The most common strategies are:

Recursive Character Splitting: Recursively splits text using separators like \n\n, \n, . , respecting the document structure.
Semantic Splitting: Uses embeddings to find natural breakpoints in the text.
Chunk Overlap: Includes overlap between consecutive chunks to preserve context at boundaries.

3. Embedding

Each chunk is transformed into a numerical vector (embedding) via an embedding model (like OpenAI's text-embedding-3-small). These vectors capture the semantic meaning of the text: sentences with similar meanings will have vectors that are close in multidimensional space.

4. Vector Store

The vectors are saved in a Vector Store (or vector database), such as ChromaDB, Pinecone, Weaviate, or FAISS. This database is optimized for similarity search: given a query, it finds the most similar vectors (and therefore the most relevant text chunks).

Phase 2: Retrieval + Generation

When the user asks a question:

The question is transformed into an embedding using the same embedding model.
The Vector Store finds the most similar chunks via similarity search (typically cosine similarity or Euclidean distance).
The retrieved chunks are inserted into the prompt as context.
The LLM generates a response based on the provided context.

Building a RAG Pipeline with LangChain

LangChain is the most popular Python (and JavaScript) framework for building LLM-powered applications. It provides high-level abstractions for every component of the RAG pipeline.

Installation

pip install langchain langchain-openai langchain-community chromadb

Step 1: Load Documents

LangChain provides dozens of Document Loaders for different data sources.

from langchain_community.document_loaders import (
    PyPDFLoader,
    WebBaseLoader,
    DirectoryLoader,
    TextLoader,
)

# Load a PDF
pdf_loader = PyPDFLoader("docs/manual.pdf")
pdf_docs = pdf_loader.load()

# Load a web page
web_loader = WebBaseLoader("https://docs.example.com/guide")
web_docs = web_loader.load()

# Load all .md files from a directory
dir_loader = DirectoryLoader("./knowledge_base", glob="**/*.md", loader_cls=TextLoader)
md_docs = dir_loader.load()

all_docs = pdf_docs + web_docs + md_docs

Step 2: Split Documents into Chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = text_splitter.split_documents(all_docs)
print(f"Original documents: {len(all_docs)}, Chunks: {len(chunks)}")

The chunk_overlap parameter is crucial: it creates overlap between consecutive chunks so that context is not lost at boundaries.

Step 3: Create Embeddings and Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db",
)

Step 4: Create the Retriever

The retriever is the component that, given a query, fetches the most relevant chunks from the vector store.

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},
)

relevant_docs = retriever.invoke("How does authentication work?")
for doc in relevant_docs:
    print(doc.page_content[:200])
    print("---")

Step 5: Build the RAG Chain

Now let's put everything together with an LLM and a prompt template.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the provided context.
If the context does not contain enough information, say you don't know.

Context:
{context}

Question: {question}

Answer:
""")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

response = rag_chain.invoke("How does authentication work in the system?")
print(response)

Advanced RAG Techniques

The basic pipeline works well, but there are several techniques to significantly improve response quality.

Multi-Query Retrieval

Sometimes the user's query is ambiguous or not aligned with the language used in the documents. The Multi-Query Retriever automatically generates variants of the original question to capture multiple perspectives.

from langchain.retrievers import MultiQueryRetriever

multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm,
)

docs = multi_retriever.invoke("What are the security best practices?")

Contextual Compression

Not all content in a chunk is relevant to the query. The Contextual Compression Retriever uses an LLM to extract only the pertinent parts from each retrieved chunk.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

Hybrid Search

Purely semantic search is not always optimal. Hybrid Search combines semantic search (embeddings) with lexical search (BM25, keyword matching) to achieve better results.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4

semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.4, 0.6],
)

Conversational RAG (with Memory)

To build a RAG chatbot that remembers the conversation context, you need to add memory that reformulates the user's questions taking the conversation history into account.

from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder

contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", "Given the chat history and the user's latest question, "
               "reformulate the question so it is understandable without the history."),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_prompt
)

Best Practices

Choose the right chunk size: Experiment with different sizes (500-1500 tokens). Smaller chunks for precise answers, larger ones for broader context.
Use document metadata: Add source, date, and category as metadata to chunks. This allows filtering results during retrieval.
Evaluate quality: Use frameworks like RAGAS to measure metrics such as faithfulness, relevancy, and context precision.
Handle document updates: Implement a re-ingestion pipeline to keep the vector store synchronized with your data sources.
Add a re-ranker: After initial retrieval, use a re-ranking model (like Cohere Rerank) to reorder results based on actual relevance.

Conclusion

RAG has become the standard architecture for building AI applications that need access to specific, up-to-date knowledge. LangChain greatly simplifies the implementation, providing abstractions for every component of the pipeline.

Next steps:

Experiment locally: Start with ChromaDB and a few documents to get familiar with the pipeline.
Explore LangSmith: Use LangSmith to monitor and debug your chains in production.
Try different embedding models: Compare models like text-embedding-3-small, text-embedding-3-large, and open-source models from Sentence Transformers.
Check the documentation: The LangChain documentation is an excellent and constantly updated resource.