Skip to content
    AI engineering roles via the DiamantAI Collective.See open roles
    Advanced RetrievalRAG Techniques

    Multi-modal RAG with Captioning

    This code implements one of the multiple ways of multi-model RAG. It extracts and processes text and images from PDFs, utilizing a multi-modal Retrieval-Augmented Generation (RAG) system for summarizing and retrieving content for question answering.

    Efficiently summarize complex documents to facilitate easy retrieval and concise responses for multi-modal data.

    What you'll learn

    • 1
      PyMuPDF: For extracting text and images from PDFs.
    • 2
      Gemini 1.5-flash model: To summarize images and tables.
    • 3
      Cohere Embeddings: For embedding document splits.
    • 4
      Chroma Vectorstore: To store and retrieve document embeddings.
    • 5
      LangChain: To orchestrate the retrieval and generation pipeline.

    About this tutorial

    This hands-on Jupyter notebook is part of RAG Techniques, a free open-source repository by Nir Diamant covering rag techniques with runnable code examples and detailed explanations.

    Free and open-sourceRunnable Jupyter notebookActive community support
    Go deeper · Amazon Bestseller in Generative AI

    RAG Made Simple

    The book that extends this repo: 22 RAG techniques with the intuition behind each, side-by-side comparisons of when each wins (and quietly fails), and original illustrations.

    Get it on Amazon

    ⭐ 4.4 stars · 1,500+ readers · Kindle $9.99 · Paperback $24.99 · Free with Kindle Unlimited

    More Advanced Retrieval tutorials

    More from RAG Techniques