spinny:~/writing $ less rag-langchain-deep-dive.md
12Large Language Models (LLMs) like GPT-4 and Claude are extraordinarily powerful, but they suffer from a fundamental limitation: their knowledge is frozen at the time of training. They cannot access your internal documents, your database, or real-time information. **Retrieval-Augmented Generation (RAG)** solves exactly this problem by combining the generative power of LLMs with the ability to retrieve information from external sources.34## The Problem: LLM Limitations56Before talking about RAG, it's important to understand why we need it.781. **Static knowledge**: An LLM only knows what it saw during training. If you ask about an event that occurred after its cutoff, it cannot answer.92. **Hallucinations**: When an LLM doesn't know the answer, it tends to fabricate one, generating plausible but completely false information.103. **No access to private data**: A generic LLM has no access to your company's internal documentation, tickets, or codebase.1112RAG addresses all three of these problems by providing the model with **relevant context** retrieved from external sources at query time.1314## What is RAG?1516Retrieval-Augmented Generation is an architecture that enriches the prompt sent to an LLM with information retrieved from an external knowledge base. Instead of relying solely on the model's parametric knowledge, RAG **searches** for relevant information first and then **injects** it into the prompt, enabling the model to generate accurate, grounded responses.1718```mermaid19graph LR20 User["User"] -- "Question" --> Retriever21 Retriever -- "Search relevant\ndocuments" --> VectorStore["Vector Store"]22 VectorStore -- "Relevant\ndocuments" --> Retriever23 Retriever -- "Context + Question" --> LLM24 LLM -- "Grounded\nresponse" --> User25```2627## How RAG Works in Detail2829The RAG architecture consists of two main phases: **Indexing** (offline) and **Retrieval + Generation** (online).3031### Phase 1: Indexing (Document Ingestion)3233The indexing phase prepares your documents for semantic search. It consists of four steps.3435```mermaid36graph TD37 A["Documents\n(PDF, HTML, MD, DB)"] --> B["Document Loader"]38 B --> C["Text Splitter"]39 C --> D["Text Chunks"]40 D --> E["Embedding Model"]41 E --> F["Numerical Vectors"]42 F --> G["Vector Store\n(ChromaDB, Pinecone, FAISS)"]43```4445#### 1. Document Loading4647Documents can come from any source: PDF files, web pages, databases, Markdown files, APIs. The **Document Loader** reads these documents and converts them into structured text.4849#### 2. Text Splitting (Chunking)5051LLMs have a limited context window, and documents can be very long. The **Text Splitter** divides documents into smaller fragments called *chunks*. The quality of chunking is critical: chunks that are too small lose context, while chunks that are too large dilute relevance.5253The most common strategies are:54- **Recursive Character Splitting**: Recursively splits text using separators like `\n\n`, `\n`, `. `, respecting the document structure.55- **Semantic Splitting**: Uses embeddings to find natural breakpoints in the text.56- **Chunk Overlap**: Includes overlap between consecutive chunks to preserve context at boundaries.5758#### 3. Embedding5960Each chunk is transformed into a **numerical vector** (embedding) via an embedding model (like OpenAI's `text-embedding-3-small`). These vectors capture the semantic meaning of the text: sentences with similar meanings will have vectors that are close in multidimensional space.6162#### 4. Vector Store6364The vectors are saved in a **Vector Store** (or vector database), such as ChromaDB, Pinecone, Weaviate, or FAISS. This database is optimized for **similarity search**: given a query, it finds the most similar vectors (and therefore the most relevant text chunks).6566### Phase 2: Retrieval + Generation6768When the user asks a question:69701. The question is transformed into an embedding using the same embedding model.712. The Vector Store finds the most similar chunks via **similarity search** (typically cosine similarity or Euclidean distance).723. The retrieved chunks are inserted into the prompt as context.734. The LLM generates a response based on the provided context.7475## Building a RAG Pipeline with LangChain7677**LangChain** is the most popular Python (and JavaScript) framework for building LLM-powered applications. It provides high-level abstractions for every component of the RAG pipeline.7879### Installation8081```bash82pip install langchain langchain-openai langchain-community chromadb83```8485### Step 1: Load Documents8687LangChain provides dozens of Document Loaders for different data sources.8889```python90from langchain_community.document_loaders import (91 PyPDFLoader,92 WebBaseLoader,93 DirectoryLoader,94 TextLoader,95)9697# Load a PDF98pdf_loader = PyPDFLoader("docs/manual.pdf")99pdf_docs = pdf_loader.load()100101# Load a web page102web_loader = WebBaseLoader("https://docs.example.com/guide")103web_docs = web_loader.load()104105# Load all .md files from a directory106dir_loader = DirectoryLoader("./knowledge_base", glob="**/*.md", loader_cls=TextLoader)107md_docs = dir_loader.load()108109all_docs = pdf_docs + web_docs + md_docs110```111112### Step 2: Split Documents into Chunks113114```python115from langchain.text_splitter import RecursiveCharacterTextSplitter116117text_splitter = RecursiveCharacterTextSplitter(118 chunk_size=1000,119 chunk_overlap=200,120 separators=["\n\n", "\n", ". ", " ", ""],121)122123chunks = text_splitter.split_documents(all_docs)124print(f"Original documents: {len(all_docs)}, Chunks: {len(chunks)}")125```126127The `chunk_overlap` parameter is crucial: it creates overlap between consecutive chunks so that context is not lost at boundaries.128129### Step 3: Create Embeddings and Vector Store130131```python132from langchain_openai import OpenAIEmbeddings133from langchain_community.vectorstores import Chroma134135embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")136137vectorstore = Chroma.from_documents(138 documents=chunks,139 embedding=embedding_model,140 persist_directory="./chroma_db",141)142```143144### Step 4: Create the Retriever145146The retriever is the component that, given a query, fetches the most relevant chunks from the vector store.147148```python149retriever = vectorstore.as_retriever(150 search_type="similarity",151 search_kwargs={"k": 4},152)153154relevant_docs = retriever.invoke("How does authentication work?")155for doc in relevant_docs:156 print(doc.page_content[:200])157 print("---")158```159160### Step 5: Build the RAG Chain161162Now let's put everything together with an LLM and a prompt template.163164```python165from langchain_openai import ChatOpenAI166from langchain_core.prompts import ChatPromptTemplate167from langchain_core.runnables import RunnablePassthrough168from langchain_core.output_parsers import StrOutputParser169170llm = ChatOpenAI(model="gpt-4o", temperature=0)171172prompt = ChatPromptTemplate.from_template("""173Answer the question based only on the provided context.174If the context does not contain enough information, say you don't know.175176Context:177{context}178179Question: {question}180181Answer:182""")183184def format_docs(docs):185 return "\n\n".join(doc.page_content for doc in docs)186187rag_chain = (188 {"context": retriever | format_docs, "question": RunnablePassthrough()}189 | prompt190 | llm191 | StrOutputParser()192)193194response = rag_chain.invoke("How does authentication work in the system?")195print(response)196```197198## Advanced RAG Techniques199200The basic pipeline works well, but there are several techniques to significantly improve response quality.201202### Multi-Query Retrieval203204Sometimes the user's query is ambiguous or not aligned with the language used in the documents. The **Multi-Query Retriever** automatically generates variants of the original question to capture multiple perspectives.205206```python207from langchain.retrievers import MultiQueryRetriever208209multi_retriever = MultiQueryRetriever.from_llm(210 retriever=vectorstore.as_retriever(),211 llm=llm,212)213214docs = multi_retriever.invoke("What are the security best practices?")215```216217### Contextual Compression218219Not all content in a chunk is relevant to the query. The **Contextual Compression Retriever** uses an LLM to extract only the pertinent parts from each retrieved chunk.220221```python222from langchain.retrievers import ContextualCompressionRetriever223from langchain.retrievers.document_compressors import LLMChainExtractor224225compressor = LLMChainExtractor.from_llm(llm)226compression_retriever = ContextualCompressionRetriever(227 base_compressor=compressor,228 base_retriever=retriever,229)230```231232### Hybrid Search233234Purely semantic search is not always optimal. **Hybrid Search** combines semantic search (embeddings) with lexical search (BM25, keyword matching) to achieve better results.235236```python237from langchain.retrievers import EnsembleRetriever238from langchain_community.retrievers import BM25Retriever239240bm25_retriever = BM25Retriever.from_documents(chunks)241bm25_retriever.k = 4242243semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})244245hybrid_retriever = EnsembleRetriever(246 retrievers=[bm25_retriever, semantic_retriever],247 weights=[0.4, 0.6],248)249```250251### Conversational RAG (with Memory)252253To build a RAG chatbot that remembers the conversation context, you need to add memory that reformulates the user's questions taking the conversation history into account.254255```python256from langchain.chains import create_history_aware_retriever257from langchain_core.prompts import MessagesPlaceholder258259contextualize_prompt = ChatPromptTemplate.from_messages([260 ("system", "Given the chat history and the user's latest question, "261 "reformulate the question so it is understandable without the history."),262 MessagesPlaceholder("chat_history"),263 ("human", "{input}"),264])265266history_aware_retriever = create_history_aware_retriever(267 llm, retriever, contextualize_prompt268)269```270271## Best Practices2722731. **Choose the right chunk size**: Experiment with different sizes (500-1500 tokens). Smaller chunks for precise answers, larger ones for broader context.2742. **Use document metadata**: Add source, date, and category as metadata to chunks. This allows filtering results during retrieval.2753. **Evaluate quality**: Use frameworks like [RAGAS](https://docs.ragas.io/) to measure metrics such as *faithfulness*, *relevancy*, and *context precision*.2764. **Handle document updates**: Implement a re-ingestion pipeline to keep the vector store synchronized with your data sources.2775. **Add a re-ranker**: After initial retrieval, use a re-ranking model (like Cohere Rerank) to reorder results based on actual relevance.278279## Conclusion280281RAG has become the standard architecture for building AI applications that need access to specific, up-to-date knowledge. LangChain greatly simplifies the implementation, providing abstractions for every component of the pipeline.282283**Next steps:**284- **Experiment locally**: Start with ChromaDB and a few documents to get familiar with the pipeline.285- **Explore LangSmith**: Use [LangSmith](https://smith.langchain.com/) to monitor and debug your chains in production.286- **Try different embedding models**: Compare models like `text-embedding-3-small`, `text-embedding-3-large`, and open-source models from Sentence Transformers.287- **Check the documentation**: The [LangChain documentation](https://python.langchain.com/docs/) is an excellent and constantly updated resource.288
:RAG and LangChain: A Complete Guide to Retrieval-Augmented Generationlines 1-288 (END) — press q to close