Semantic Chunking
This code implements a semantic chunking approach for processing and retrieving information from PDF documents, first proposed by Greg Kamradt and subsequently implemented in LangChain. Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.
Traditional text splitting methods often break documents at arbitrary points, potentially disrupting the flow of information and context. Semantic chunking addresses this issue by attempting to split text at more natural breakpoints, preserving semantic coherence within each chunk.
What you'll learn
- 1PDF processing and text extraction
- 2Semantic chunking using LangChain's SemanticChunker
- 3Vector store creation using FAISS and OpenAI embeddings
- 4Retriever setup for querying the processed documents
About this tutorial
This hands-on Jupyter notebook is part of RAG Techniques, a free open-source repository by Nir Diamant covering rag techniques with runnable code examples and detailed explanations.
RAG Made Simple
The book that extends this repo: 22 RAG techniques with the intuition behind each, side-by-side comparisons of when each wins (and quietly fails), and original illustrations.
Get it on Amazon⭐ 4.4 stars · 1,500+ readers · Kindle $9.99 · Paperback $24.99 · Free with Kindle Unlimited
