• unwind ai
  • Posts
  • Build a Vision RAG App with Gemini 2.5 Flash

Build a Vision RAG App with Gemini 2.5 Flash

Fully functional multimodal RAG app with vision with step-by-step instructions (100% opensource)

Charts, diagrams, and visual data in PDFs remain a massive blind spot for most RAG systems. While text-based RAG has become relatively straightforward to implement, extracting meaningful insights from visual elements requires specialized approaches that many developers struggle to implement efficiently. The standard workaround of OCR followed by text embedding loses crucial context and fails completely with complex visual elements.

In this tutorial, we'll build a cutting-edge Vision RAG system that uses Cohere's Embed-4 model to create unified vector representations that capture both visual and textual elements. Then, we'll use Google's Gemini 2.5 Flash to analyze these retrievals and generate comprehensive answers by fully understanding the visual context.

What makes Cohere's Embed-4 truly game-changing is its ability to generate high-quality embeddings of complex mixed-modality documents within a unified vector space. This allows for precise retrieval across image and PDF content while preserving the semantic connections between visual elements and text. Gemini complements this perfectly by leveraging its advanced multimodal capabilities to interpret these visuals in context—whether they're financial graphs, technical diagrams, or data-heavy tables.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

What We’re Building

This Streamlit application implements a vision-aware RAG system that can analyze charts, infographics, and PDF documents.

Features:

  • Multimodal Search: Uses Cohere Embed-4 to find the most semantically relevant image for a given text question

  • Visual Question Answering: Employs Google Gemini 2.5 Flash to analyze retrieved images and generate accurate answers

  • No OCR Required: Directly processes complex images and visual elements within PDF pages without needing separate text extraction steps

  • Multiple Content Sources: Handles sample financial charts, custom uploaded images, and PDF documents

  • Interactive UI: Clean Streamlit interface for uploading content and asking questions

  • Session Management: Remembers loaded/uploaded content (images and processed PDF pages) within a session.

How The App Works

The Vision RAG system operates through a two-stage process that leverages each model's strengths:

  1. Retrieval Stage with Cohere Embed-4:

    • When you load or upload images/PDFs, Cohere's Embed-4 model processes each visual item

    • The model generates unified vector embeddings that capture both the visual elements and any textual components

    • Unlike traditional approaches, these embeddings preserve the semantic relationship between visuals and text in a single vector space

    • When you ask a question, Embed-4 processes your query in the same vector space

    • The system identifies the most contextually relevant image through vector similarity, whether it's a chart, infographic, or PDF page.

  2. Generation Stage with Gemini 2.5 Flash:

    • Your question and the retrieved image are sent to Gemini 2.5 Flash

    • Gemini's multimodal understanding allows it to analyze complex visual elements—charts, tables, graphs—directly

    • The model can interpret numerical data, identify trends, and understand contextual relationships within the image

    • It generates comprehensive answers that combine information from visual elements with relevant context

    • All this happens without any OCR preprocessing or separate vision-to-text conversion steps.

Prerequisites

Before we begin, make sure you have the following:

  1. Python installed on your machine (version 3.10 or higher is recommended)

  2. Your Cohere and Gemini API key

  3. A code editor of your choice (we recommend VS Code or PyCharm for their excellent Python support)

  4. Basic familiarity with Python programming

Code Walkthrough

Setting Up the Environment

First, let's get our development environment ready:

  1. Clone the GitHub repository:

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
  1. Go to the vision_rag folder:

cd rag_tutorials/vision_rag
pip install -r requirements.txt

Creating the Streamlit App

Let’s create our app. Create a new file vision_rag.py and add the following code:

  1. Import necessary libraries, set up the APIs and configuration:

import streamlit as st
import cohere
from google import genai
import PIL
import numpy as np
import fitz  # PyMuPDF

st.set_page_config(layout="wide", page_title="Vision RAG with Cohere Embed-4")
st.title("Vision RAG with Cohere Embed-4 🖼️")

# API Key input section
with st.sidebar:
    st.header("🔑 API Keys")
    cohere_api_key = st.text_input("Cohere API Key", type="password", key="cohere_key")
    google_api_key = st.text_input("Google API Key (Gemini)", type="password", key="google_key")
    
# Initialize API clients
co = None
genai_client = None

if cohere_api_key and google_api_key:
    try:
        co = cohere.ClientV2(api_key=cohere_api_key)
        genai_client = genai.Client(api_key=google_api_key)
    except Exception as e:
        st.sidebar.error(f"Initialization Failed: {e}")
  1. Image Processing Utilities:

# Resize large images to fit model constraints
def resize_image(pil_image: PIL.Image.Image) -> None:
    org_width, org_height = pil_image.size
    max_pixels = 1568*1568  # Max resolution 
    
    if org_width * org_height > max_pixels:
        scale_factor = (max_pixels / (org_width * org_height)) ** 0.5
        new_width = int(org_width * scale_factor)
        new_height = int(org_height * scale_factor)
        pil_image.thumbnail((new_width, new_height))

# Convert images to base64 for API compatibility        
def base64_from_image(img_path: str) -> str:
    pil_image = PIL.Image.open(img_path)
    img_format = pil_image.format if pil_image.format else "PNG"
    resize_image(pil_image)
    
    with io.BytesIO() as img_buffer:
        pil_image.save(img_buffer, format=img_format)
        img_buffer.seek(0)
        img_data = f"data:image/{img_format.lower()};base64,"+base64.b64encode(img_buffer.read()).decode("utf-8")
    
    return img_data
  1. Embedding Generation with Cohere:

@st.cache_data(ttl=3600, show_spinner=False)
def compute_image_embedding(base64_img: str, _cohere_client) -> np.ndarray | None:
    try:
        api_response = _cohere_client.embed(
            model="embed-v4.0",
            input_type="search_document",
            embedding_types=["float"],
            images=[base64_img],
        )
        
        if api_response.embeddings and api_response.embeddings.float:
            return np.asarray(api_response.embeddings.float[0])
        else:
            st.warning("Could not get embedding. API response might be empty.")
            return None
    except Exception as e:
        st.error(f"Error computing embedding: {e}")
        return None
  1. PDF Processing:

def process_pdf_file(pdf_file, cohere_client, base_output_folder="pdf_pages") -> tuple[list[str], list[np.ndarray] | None]:
    page_image_paths = []
    page_embeddings = []
    pdf_filename = pdf_file.name
    output_folder = os.path.join(base_output_folder, os.path.splitext(pdf_filename)[0])
    os.makedirs(output_folder, exist_ok=True)
    
    try:
        # Open PDF from stream
        doc = fitz.open(stream=pdf_file.read(), filetype="pdf")
        pdf_progress = st.progress(0.0)
        
        for i, page in enumerate(doc.pages()):
            page_num = i + 1
            page_img_path = os.path.join(output_folder, f"page_{page_num}.png")
            page_image_paths.append(page_img_path)
            
            # Render page to image
            pix = page.get_pixmap(dpi=150)
            pil_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            pil_image.save(page_img_path, "PNG")
            
            # Generate embedding for page image
            base64_img = pil_to_base64(pil_image)
            emb = compute_image_embedding(base64_img, _cohere_client=cohere_client)
            
            if emb is not None:
                page_embeddings.append(emb)
            
            # Update progress
            pdf_progress.progress((i + 1) / len(doc))
            
        # Filter out failed embeddings
        valid_paths = [path for i, path in enumerate(page_image_paths) if i < len(page_embeddings) and page_embeddings[i] is not None]
        valid_embeddings = [emb for emb in page_embeddings if emb is not None]
        
        return valid_paths, valid_embeddings
        
    except Exception as e:
        st.error(f"Error processing PDF {pdf_filename}: {e}")
        return [], None
  1. Search Function:

def search(question: str, co_client: cohere.Client, embeddings: np.ndarray, image_paths: list[str]) -> str | None:
    try:
        # Compute embedding for the query
        api_response = co_client.embed(
            model="embed-v4.0",
            input_type="search_query",
            embedding_types=["float"],
            texts=[question],
        )
        
        query_emb = np.asarray(api_response.embeddings.float[0])
        
        # Compute cosine similarities
        cos_sim_scores = np.dot(query_emb, embeddings.T)
        
        # Get the most relevant image
        top_idx = np.argmax(cos_sim_scores)
        hit_img_path = image_paths[top_idx]
        
        return hit_img_path
        
    except Exception as e:
        st.error(f"Error during search: {e}")
        return None
  1. Answer Generation with Gemini:

def answer(question: str, img_path: str, gemini_client) -> str:
    try:
        img = PIL.Image.open(img_path)
        
        prompt = [f"""Answer the question based on the following image. Be as elaborate as possible giving extra relevant information.
        Don't use markdown formatting in the response.
        Please provide enough context for your answer.
        Question: {question}""", img]
        
        response = gemini_client.models.generate_content(
            model="gemini-2.5-flash-preview-04-17",
            contents=prompt
        )
        
        llm_answer = response.text
        return llm_answer
        
    except Exception as e:
        st.error(f"Error during answer generation: {e}")
        return f"Failed to generate answer: {e}"
  1. Main RAG Flow:

# When the user asks a question:
if run_button:
    if co and genai_client and st.session_state.doc_embeddings is not None:
        with st.spinner("Finding relevant image..."):
            # Find the most relevant image
            top_image_path = search(question, co, st.session_state.doc_embeddings, st.session_state.image_paths)
            
            if top_image_path:
                # Display the retrieved image
                retrieved_image_placeholder.image(top_image_path, caption=caption, use_container_width=True)
                
                # Generate answer from the image
                with st.spinner("Generating answer..."):
                    final_answer = answer(question, top_image_path, genai_client)
                    answer_placeholder.markdown(f"**Answer:**\n{final_answer}")

Running the App

With our code in place, it's time to launch the app.

  • In your terminal, navigate to the project folder, and run the following command

streamlit run vision_rag.py
  • Streamlit will provide a local URL (typically http://localhost:8501). Open this in your web browser, put in your API key, upload images/PDFs rich in visual content, and it’s ready!

Working Application Demo

Conclusion

You've just built a Vision RAG system that actually works with visual content. This opens up possibilities for analyzing financial reports, technical documentation, research papers - basically anything where important information is shown visually rather than just written out.

To enhance this application further, consider:

  1. Adding confidence scoring to ensure only highly relevant visuals are analyzed

  2. Creating a gallery view that shows multiple potentially relevant images sorted by similarity score rather than just the top result

  3. Implementing a region-of-interest detector that can identify and zoom in on specific chart elements (axes, legends, data points) relevant to the question

  4. Including a feedback loop system that uses user ratings on answers to fine-tune retrieval thresholds and ranking.

Keep experimenting with different configurations and features to build more sophisticated AI applications.

We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Reply

or to participate.