RAG and LangChain: A Complete Guide to Retrieval-Augmented Generation

spinny:~/writing $ less rag-langchain-deep-dive.md

1 
2Large Language Models (LLMs) like GPT-4 and Claude are extraordinarily powerful, but they suffer from a fundamental limitation: their knowledge is frozen at the time of training. They cannot access your internal documents, your database, or real-time information. **Retrieval-Augmented Generation (RAG)** solves exactly this problem by combining the generative power of LLMs with the ability to retrieve information from external sources.
3 
4## The Problem: LLM Limitations
5 
6Before talking about RAG, it's important to understand why we need it.
7 
81.  **Static knowledge**: An LLM only knows what it saw during training. If you ask about an event that occurred after its cutoff, it cannot answer.
92.  **Hallucinations**: When an LLM doesn't know the answer, it tends to fabricate one, generating plausible but completely false information.
103.  **No access to private data**: A generic LLM has no access to your company's internal documentation, tickets, or codebase.
11 
12RAG addresses all three of these problems by providing the model with **relevant context** retrieved from external sources at query time.
13 
14## What is RAG?
15 
16Retrieval-Augmented Generation is an architecture that enriches the prompt sent to an LLM with information retrieved from an external knowledge base. Instead of relying solely on the model's parametric knowledge, RAG **searches** for relevant information first and then **injects** it into the prompt, enabling the model to generate accurate, grounded responses.
17 
18```mermaid
19graph LR
20    User["User"] -- "Question" --> Retriever
21    Retriever -- "Search relevant\ndocuments" --> VectorStore["Vector Store"]
22    VectorStore -- "Relevant\ndocuments" --> Retriever
23    Retriever -- "Context + Question" --> LLM
24    LLM -- "Grounded\nresponse" --> User
25```
26 
27## How RAG Works in Detail
28 
29The RAG architecture consists of two main phases: **Indexing** (offline) and **Retrieval + Generation** (online).
30 
31### Phase 1: Indexing (Document Ingestion)
32 
33The indexing phase prepares your documents for semantic search. It consists of four steps.
34 
35```mermaid
36graph TD
37    A["Documents\n(PDF, HTML, MD, DB)"] --> B["Document Loader"]
38    B --> C["Text Splitter"]
39    C --> D["Text Chunks"]
40    D --> E["Embedding Model"]
41    E --> F["Numerical Vectors"]
42    F --> G["Vector Store\n(ChromaDB, Pinecone, FAISS)"]
43```
44 
45#### 1. Document Loading
46 
47Documents can come from any source: PDF files, web pages, databases, Markdown files, APIs. The **Document Loader** reads these documents and converts them into structured text.
48 
49#### 2. Text Splitting (Chunking)
50 
51LLMs have a limited context window, and documents can be very long. The **Text Splitter** divides documents into smaller fragments called *chunks*. The quality of chunking is critical: chunks that are too small lose context, while chunks that are too large dilute relevance.
52 
53The most common strategies are:
54-   **Recursive Character Splitting**: Recursively splits text using separators like `\n\n`, `\n`, `. `, respecting the document structure.
55-   **Semantic Splitting**: Uses embeddings to find natural breakpoints in the text.
56-   **Chunk Overlap**: Includes overlap between consecutive chunks to preserve context at boundaries.
57 
58#### 3. Embedding
59 
60Each chunk is transformed into a **numerical vector** (embedding) via an embedding model (like OpenAI's `text-embedding-3-small`). These vectors capture the semantic meaning of the text: sentences with similar meanings will have vectors that are close in multidimensional space.
61 
62#### 4. Vector Store
63 
64The vectors are saved in a **Vector Store** (or vector database), such as ChromaDB, Pinecone, Weaviate, or FAISS. This database is optimized for **similarity search**: given a query, it finds the most similar vectors (and therefore the most relevant text chunks).
65 
66### Phase 2: Retrieval + Generation
67 
68When the user asks a question:
69 
701.  The question is transformed into an embedding using the same embedding model.
712.  The Vector Store finds the most similar chunks via **similarity search** (typically cosine similarity or Euclidean distance).
723.  The retrieved chunks are inserted into the prompt as context.
734.  The LLM generates a response based on the provided context.
74 
75## Building a RAG Pipeline with LangChain
76 
77**LangChain** is the most popular Python (and JavaScript) framework for building LLM-powered applications. It provides high-level abstractions for every component of the RAG pipeline.
78 
79### Installation
80 
81```bash
82pip install langchain langchain-openai langchain-community chromadb
83```
84 
85### Step 1: Load Documents
86 
87LangChain provides dozens of Document Loaders for different data sources.
88 
89```python
90from langchain_community.document_loaders import (
91    PyPDFLoader,
92    WebBaseLoader,
93    DirectoryLoader,
94    TextLoader,
95)
96 
97# Load a PDF
98pdf_loader = PyPDFLoader("docs/manual.pdf")
99pdf_docs = pdf_loader.load()
100 
101# Load a web page
102web_loader = WebBaseLoader("https://docs.example.com/guide")
103web_docs = web_loader.load()
104 
105# Load all .md files from a directory
106dir_loader = DirectoryLoader("./knowledge_base", glob="**/*.md", loader_cls=TextLoader)
107md_docs = dir_loader.load()
108 
109all_docs = pdf_docs + web_docs + md_docs
110```
111 
112### Step 2: Split Documents into Chunks
113 
114```python
115from langchain.text_splitter import RecursiveCharacterTextSplitter
116 
117text_splitter = RecursiveCharacterTextSplitter(
118    chunk_size=1000,
119    chunk_overlap=200,
120    separators=["\n\n", "\n", ". ", " ", ""],
121)
122 
123chunks = text_splitter.split_documents(all_docs)
124print(f"Original documents: {len(all_docs)}, Chunks: {len(chunks)}")
125```
126 
127The `chunk_overlap` parameter is crucial: it creates overlap between consecutive chunks so that context is not lost at boundaries.
128 
129### Step 3: Create Embeddings and Vector Store
130 
131```python
132from langchain_openai import OpenAIEmbeddings
133from langchain_community.vectorstores import Chroma
134 
135embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
136 
137vectorstore = Chroma.from_documents(
138    documents=chunks,
139    embedding=embedding_model,
140    persist_directory="./chroma_db",
141)
142```
143 
144### Step 4: Create the Retriever
145 
146The retriever is the component that, given a query, fetches the most relevant chunks from the vector store.
147 
148```python
149retriever = vectorstore.as_retriever(
150    search_type="similarity",
151    search_kwargs={"k": 4},
152)
153 
154relevant_docs = retriever.invoke("How does authentication work?")
155for doc in relevant_docs:
156    print(doc.page_content[:200])
157    print("---")
158```
159 
160### Step 5: Build the RAG Chain
161 
162Now let's put everything together with an LLM and a prompt template.
163 
164```python
165from langchain_openai import ChatOpenAI
166from langchain_core.prompts import ChatPromptTemplate
167from langchain_core.runnables import RunnablePassthrough
168from langchain_core.output_parsers import StrOutputParser
169 
170llm = ChatOpenAI(model="gpt-4o", temperature=0)
171 
172prompt = ChatPromptTemplate.from_template("""
173Answer the question based only on the provided context.
174If the context does not contain enough information, say you don't know.
175 
176Context:
177{context}
178 
179Question: {question}
180 
181Answer:
182""")
183 
184def format_docs(docs):
185    return "\n\n".join(doc.page_content for doc in docs)
186 
187rag_chain = (
188    {"context": retriever | format_docs, "question": RunnablePassthrough()}
189    | prompt
190    | llm
191    | StrOutputParser()
192)
193 
194response = rag_chain.invoke("How does authentication work in the system?")
195print(response)
196```
197 
198## Advanced RAG Techniques
199 
200The basic pipeline works well, but there are several techniques to significantly improve response quality.
201 
202### Multi-Query Retrieval
203 
204Sometimes the user's query is ambiguous or not aligned with the language used in the documents. The **Multi-Query Retriever** automatically generates variants of the original question to capture multiple perspectives.
205 
206```python
207from langchain.retrievers import MultiQueryRetriever
208 
209multi_retriever = MultiQueryRetriever.from_llm(
210    retriever=vectorstore.as_retriever(),
211    llm=llm,
212)
213 
214docs = multi_retriever.invoke("What are the security best practices?")
215```
216 
217### Contextual Compression
218 
219Not all content in a chunk is relevant to the query. The **Contextual Compression Retriever** uses an LLM to extract only the pertinent parts from each retrieved chunk.
220 
221```python
222from langchain.retrievers import ContextualCompressionRetriever
223from langchain.retrievers.document_compressors import LLMChainExtractor
224 
225compressor = LLMChainExtractor.from_llm(llm)
226compression_retriever = ContextualCompressionRetriever(
227    base_compressor=compressor,
228    base_retriever=retriever,
229)
230```
231 
232### Hybrid Search
233 
234Purely semantic search is not always optimal. **Hybrid Search** combines semantic search (embeddings) with lexical search (BM25, keyword matching) to achieve better results.
235 
236```python
237from langchain.retrievers import EnsembleRetriever
238from langchain_community.retrievers import BM25Retriever
239 
240bm25_retriever = BM25Retriever.from_documents(chunks)
241bm25_retriever.k = 4
242 
243semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
244 
245hybrid_retriever = EnsembleRetriever(
246    retrievers=[bm25_retriever, semantic_retriever],
247    weights=[0.4, 0.6],
248)
249```
250 
251### Conversational RAG (with Memory)
252 
253To build a RAG chatbot that remembers the conversation context, you need to add memory that reformulates the user's questions taking the conversation history into account.
254 
255```python
256from langchain.chains import create_history_aware_retriever
257from langchain_core.prompts import MessagesPlaceholder
258 
259contextualize_prompt = ChatPromptTemplate.from_messages([
260    ("system", "Given the chat history and the user's latest question, "
261               "reformulate the question so it is understandable without the history."),
262    MessagesPlaceholder("chat_history"),
263    ("human", "{input}"),
264])
265 
266history_aware_retriever = create_history_aware_retriever(
267    llm, retriever, contextualize_prompt
268)
269```
270 
271## Best Practices
272 
2731.  **Choose the right chunk size**: Experiment with different sizes (500-1500 tokens). Smaller chunks for precise answers, larger ones for broader context.
2742.  **Use document metadata**: Add source, date, and category as metadata to chunks. This allows filtering results during retrieval.
2753.  **Evaluate quality**: Use frameworks like [RAGAS](https://docs.ragas.io/) to measure metrics such as *faithfulness*, *relevancy*, and *context precision*.
2764.  **Handle document updates**: Implement a re-ingestion pipeline to keep the vector store synchronized with your data sources.
2775.  **Add a re-ranker**: After initial retrieval, use a re-ranking model (like Cohere Rerank) to reorder results based on actual relevance.
278 
279## Conclusion
280 
281RAG has become the standard architecture for building AI applications that need access to specific, up-to-date knowledge. LangChain greatly simplifies the implementation, providing abstractions for every component of the pipeline.
282 
283**Next steps:**
284- **Experiment locally**: Start with ChromaDB and a few documents to get familiar with the pipeline.
285- **Explore LangSmith**: Use [LangSmith](https://smith.langchain.com/) to monitor and debug your chains in production.
286- **Try different embedding models**: Compare models like `text-embedding-3-small`, `text-embedding-3-large`, and open-source models from Sentence Transformers.
287- **Check the documentation**: The [LangChain documentation](https://python.langchain.com/docs/) is an excellent and constantly updated resource.
288

:RAG and LangChain: A Complete Guide to Retrieval-Augmented Generationlines 1-288 (END) — press q to close

2Large Language Models (LLMs) like GPT-4 and Claude are extraordinarily powerful, but they suffer from a fundamental limitation: their knowledge is frozen at the time of training. They cannot access your internal documents, your database, or real-time information. **Retrieval-Augmented Generation (RAG)** solves exactly this problem by combining the generative power of LLMs with the ability to retrieve information from external sources.

4## The Problem: LLM Limitations

6Before talking about RAG, it's important to understand why we need it.

81. **Static knowledge**: An LLM only knows what it saw during training. If you ask about an event that occurred after its cutoff, it cannot answer.

92. **Hallucinations**: When an LLM doesn't know the answer, it tends to fabricate one, generating plausible but completely false information.

103. **No access to private data**: A generic LLM has no access to your company's internal documentation, tickets, or codebase.

12RAG addresses all three of these problems by providing the model with **relevant context** retrieved from external sources at query time.

14## What is RAG?

16Retrieval-Augmented Generation is an architecture that enriches the prompt sent to an LLM with information retrieved from an external knowledge base. Instead of relying solely on the model's parametric knowledge, RAG **searches** for relevant information first and then **injects** it into the prompt, enabling the model to generate accurate, grounded responses.

18```mermaid

19graph LR

20 User["User"] -- "Question" --> Retriever

21 Retriever -- "Search relevant\ndocuments" --> VectorStore["Vector Store"]

22 VectorStore -- "Relevant\ndocuments" --> Retriever

23 Retriever -- "Context + Question" --> LLM

24 LLM -- "Grounded\nresponse" --> User

25```

27## How RAG Works in Detail

29The RAG architecture consists of two main phases: **Indexing** (offline) and **Retrieval + Generation** (online).

31### Phase 1: Indexing (Document Ingestion)

33The indexing phase prepares your documents for semantic search. It consists of four steps.

35```mermaid

36graph TD

37 A["Documents\n(PDF, HTML, MD, DB)"] --> B["Document Loader"]

38 B --> C["Text Splitter"]

39 C --> D["Text Chunks"]

40 D --> E["Embedding Model"]

41 E --> F["Numerical Vectors"]

42 F --> G["Vector Store\n(ChromaDB, Pinecone, FAISS)"]

43```

45#### 1. Document Loading

47Documents can come from any source: PDF files, web pages, databases, Markdown files, APIs. The **Document Loader** reads these documents and converts them into structured text.

49#### 2. Text Splitting (Chunking)

51LLMs have a limited context window, and documents can be very long. The **Text Splitter** divides documents into smaller fragments called *chunks*. The quality of chunking is critical: chunks that are too small lose context, while chunks that are too large dilute relevance.

53The most common strategies are:

54- **Recursive Character Splitting**: Recursively splits text using separators like `\n\n`, `\n`, `. `, respecting the document structure.

55- **Semantic Splitting**: Uses embeddings to find natural breakpoints in the text.

56- **Chunk Overlap**: Includes overlap between consecutive chunks to preserve context at boundaries.

58#### 3. Embedding

60Each chunk is transformed into a **numerical vector** (embedding) via an embedding model (like OpenAI's `text-embedding-3-small`). These vectors capture the semantic meaning of the text: sentences with similar meanings will have vectors that are close in multidimensional space.

62#### 4. Vector Store

64The vectors are saved in a **Vector Store** (or vector database), such as ChromaDB, Pinecone, Weaviate, or FAISS. This database is optimized for **similarity search**: given a query, it finds the most similar vectors (and therefore the most relevant text chunks).

66### Phase 2: Retrieval + Generation

68When the user asks a question:

701. The question is transformed into an embedding using the same embedding model.

712. The Vector Store finds the most similar chunks via **similarity search** (typically cosine similarity or Euclidean distance).

723. The retrieved chunks are inserted into the prompt as context.

734. The LLM generates a response based on the provided context.

75## Building a RAG Pipeline with LangChain

77**LangChain** is the most popular Python (and JavaScript) framework for building LLM-powered applications. It provides high-level abstractions for every component of the RAG pipeline.

79### Installation

81```bash

82pip install langchain langchain-openai langchain-community chromadb

83```

85### Step 1: Load Documents

87LangChain provides dozens of Document Loaders for different data sources.

89```python

90from langchain_community.document_loaders import (

91 PyPDFLoader,

92 WebBaseLoader,

93 DirectoryLoader,

94 TextLoader,

95)

97# Load a PDF

98pdf_loader = PyPDFLoader("docs/manual.pdf")

99pdf_docs = pdf_loader.load()

100

101# Load a web page

102web_loader = WebBaseLoader("https://docs.example.com/guide")

103web_docs = web_loader.load()

104

105# Load all .md files from a directory

106dir_loader = DirectoryLoader("./knowledge_base", glob="**/*.md", loader_cls=TextLoader)

107md_docs = dir_loader.load()

108

109all_docs = pdf_docs + web_docs + md_docs

110```

111

112### Step 2: Split Documents into Chunks

113

114```python

115from langchain.text_splitter import RecursiveCharacterTextSplitter

116

117text_splitter = RecursiveCharacterTextSplitter(

118 chunk_size=1000,

119 chunk_overlap=200,

120 separators=["\n\n", "\n", ". ", " ", ""],

121)

122

123chunks = text_splitter.split_documents(all_docs)

124print(f"Original documents: {len(all_docs)}, Chunks: {len(chunks)}")

125```

126

127The `chunk_overlap` parameter is crucial: it creates overlap between consecutive chunks so that context is not lost at boundaries.

128

129### Step 3: Create Embeddings and Vector Store

130

131```python

132from langchain_openai import OpenAIEmbeddings

133from langchain_community.vectorstores import Chroma

134

135embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

136

137vectorstore = Chroma.from_documents(

138 documents=chunks,

139 embedding=embedding_model,

140 persist_directory="./chroma_db",

141)

142```

143

144### Step 4: Create the Retriever

145

146The retriever is the component that, given a query, fetches the most relevant chunks from the vector store.

147

148```python

149retriever = vectorstore.as_retriever(

150 search_type="similarity",

151 search_kwargs={"k": 4},

152)

153

154relevant_docs = retriever.invoke("How does authentication work?")

155for doc in relevant_docs:

156 print(doc.page_content[:200])

157 print("---")

158```

159

160### Step 5: Build the RAG Chain

161

162Now let's put everything together with an LLM and a prompt template.

163

164```python

165from langchain_openai import ChatOpenAI

166from langchain_core.prompts import ChatPromptTemplate

167from langchain_core.runnables import RunnablePassthrough

168from langchain_core.output_parsers import StrOutputParser

169

170llm = ChatOpenAI(model="gpt-4o", temperature=0)

171

172prompt = ChatPromptTemplate.from_template("""

173Answer the question based only on the provided context.

174If the context does not contain enough information, say you don't know.

175

176Context:

177{context}

178

179Question: {question}

180

181Answer:

182""")

183

184def format_docs(docs):

185 return "\n\n".join(doc.page_content for doc in docs)

186

187rag_chain = (

188 {"context": retriever | format_docs, "question": RunnablePassthrough()}

189 | prompt

190 | llm

191 | StrOutputParser()

192)

193

194response = rag_chain.invoke("How does authentication work in the system?")

195print(response)

196```

197

198## Advanced RAG Techniques

199

200The basic pipeline works well, but there are several techniques to significantly improve response quality.

201

202### Multi-Query Retrieval

203

204Sometimes the user's query is ambiguous or not aligned with the language used in the documents. The **Multi-Query Retriever** automatically generates variants of the original question to capture multiple perspectives.

205

206```python

207from langchain.retrievers import MultiQueryRetriever

208

209multi_retriever = MultiQueryRetriever.from_llm(

210 retriever=vectorstore.as_retriever(),

211 llm=llm,

212)

213

214docs = multi_retriever.invoke("What are the security best practices?")

215```

216

217### Contextual Compression

218

219Not all content in a chunk is relevant to the query. The **Contextual Compression Retriever** uses an LLM to extract only the pertinent parts from each retrieved chunk.

220

221```python

222from langchain.retrievers import ContextualCompressionRetriever

223from langchain.retrievers.document_compressors import LLMChainExtractor

224

225compressor = LLMChainExtractor.from_llm(llm)

226compression_retriever = ContextualCompressionRetriever(

227 base_compressor=compressor,

228 base_retriever=retriever,

229)

230```

231

232### Hybrid Search

233

234Purely semantic search is not always optimal. **Hybrid Search** combines semantic search (embeddings) with lexical search (BM25, keyword matching) to achieve better results.

235

236```python

237from langchain.retrievers import EnsembleRetriever

238from langchain_community.retrievers import BM25Retriever

239

240bm25_retriever = BM25Retriever.from_documents(chunks)

241bm25_retriever.k = 4

242

243semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

244

245hybrid_retriever = EnsembleRetriever(

246 retrievers=[bm25_retriever, semantic_retriever],

247 weights=[0.4, 0.6],

248)

249```

250

251### Conversational RAG (with Memory)

252

253To build a RAG chatbot that remembers the conversation context, you need to add memory that reformulates the user's questions taking the conversation history into account.

254

255```python

256from langchain.chains import create_history_aware_retriever

257from langchain_core.prompts import MessagesPlaceholder

258

259contextualize_prompt = ChatPromptTemplate.from_messages([

260 ("system", "Given the chat history and the user's latest question, "

261 "reformulate the question so it is understandable without the history."),

262 MessagesPlaceholder("chat_history"),

263 ("human", "{input}"),

264])

265

266history_aware_retriever = create_history_aware_retriever(

267 llm, retriever, contextualize_prompt

268)

269```

270

271## Best Practices

272

2731. **Choose the right chunk size**: Experiment with different sizes (500-1500 tokens). Smaller chunks for precise answers, larger ones for broader context.

2742. **Use document metadata**: Add source, date, and category as metadata to chunks. This allows filtering results during retrieval.

2753. **Evaluate quality**: Use frameworks like [RAGAS](https://docs.ragas.io/) to measure metrics such as *faithfulness*, *relevancy*, and *context precision*.

2764. **Handle document updates**: Implement a re-ingestion pipeline to keep the vector store synchronized with your data sources.

2775. **Add a re-ranker**: After initial retrieval, use a re-ranking model (like Cohere Rerank) to reorder results based on actual relevance.

278

279## Conclusion

280

281RAG has become the standard architecture for building AI applications that need access to specific, up-to-date knowledge. LangChain greatly simplifies the implementation, providing abstractions for every component of the pipeline.

282

283**Next steps:**

284- **Experiment locally**: Start with ChromaDB and a few documents to get familiar with the pipeline.

285- **Explore LangSmith**: Use [LangSmith](https://smith.langchain.com/) to monitor and debug your chains in production.

286- **Try different embedding models**: Compare models like `text-embedding-3-small`, `text-embedding-3-large`, and open-source models from Sentence Transformers.

287- **Check the documentation**: The [LangChain documentation](https://python.langchain.com/docs/) is an excellent and constantly updated resource.

288