- unwind ai
- Posts
- Build a Vision RAG App with Gemini 2.5 Flash
Build a Vision RAG App with Gemini 2.5 Flash
Fully functional multimodal RAG app with vision with step-by-step instructions (100% opensource)
Charts, diagrams, and visual data in PDFs remain a massive blind spot for most RAG systems. While text-based RAG has become relatively straightforward to implement, extracting meaningful insights from visual elements requires specialized approaches that many developers struggle to implement efficiently. The standard workaround of OCR followed by text embedding loses crucial context and fails completely with complex visual elements.
In this tutorial, we'll build a cutting-edge Vision RAG system that uses Cohere's Embed-4 model to create unified vector representations that capture both visual and textual elements. Then, we'll use Google's Gemini 2.5 Flash to analyze these retrievals and generate comprehensive answers by fully understanding the visual context.
What makes Cohere's Embed-4 truly game-changing is its ability to generate high-quality embeddings of complex mixed-modality documents within a unified vector space. This allows for precise retrieval across image and PDF content while preserving the semantic connections between visual elements and text. Gemini complements this perfectly by leveraging its advanced multimodal capabilities to interpret these visuals in context—whether they're financial graphs, technical diagrams, or data-heavy tables.
What We’re Building
This Streamlit application implements a vision-aware RAG system that can analyze charts, infographics, and PDF documents.
Features:
Multimodal Search: Uses Cohere Embed-4 to find the most semantically relevant image for a given text question
Visual Question Answering: Employs Google Gemini 2.5 Flash to analyze retrieved images and generate accurate answers
No OCR Required: Directly processes complex images and visual elements within PDF pages without needing separate text extraction steps
Multiple Content Sources: Handles sample financial charts, custom uploaded images, and PDF documents
Interactive UI: Clean Streamlit interface for uploading content and asking questions
Session Management: Remembers loaded/uploaded content (images and processed PDF pages) within a session.
How The App Works
The Vision RAG system operates through a two-stage process that leverages each model's strengths:
Retrieval Stage with Cohere Embed-4:
When you load or upload images/PDFs, Cohere's Embed-4 model processes each visual item
The model generates unified vector embeddings that capture both the visual elements and any textual components
Unlike traditional approaches, these embeddings preserve the semantic relationship between visuals and text in a single vector space
When you ask a question, Embed-4 processes your query in the same vector space
The system identifies the most contextually relevant image through vector similarity, whether it's a chart, infographic, or PDF page.
Generation Stage with Gemini 2.5 Flash:
Your question and the retrieved image are sent to Gemini 2.5 Flash
Gemini's multimodal understanding allows it to analyze complex visual elements—charts, tables, graphs—directly
The model can interpret numerical data, identify trends, and understand contextual relationships within the image
It generates comprehensive answers that combine information from visual elements with relevant context
All this happens without any OCR preprocessing or separate vision-to-text conversion steps.
Prerequisites
Before we begin, make sure you have the following:
Code Walkthrough
Setting Up the Environment
First, let's get our development environment ready:
Clone the GitHub repository:
git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
Go to the vision_rag folder:
cd rag_tutorials/vision_rag
Install the required dependencies:
pip install -r requirements.txt
API Keys: Get your Gemini API key from the Google AI Studio and your Cohere API key.
Creating the Streamlit App
Let’s create our app. Create a new file vision_rag.py
and add the following code:
Import necessary libraries, set up the APIs and configuration:
import streamlit as st
import cohere
from google import genai
import PIL
import numpy as np
import fitz # PyMuPDF
st.set_page_config(layout="wide", page_title="Vision RAG with Cohere Embed-4")
st.title("Vision RAG with Cohere Embed-4 🖼️")
# API Key input section
with st.sidebar:
st.header("🔑 API Keys")
cohere_api_key = st.text_input("Cohere API Key", type="password", key="cohere_key")
google_api_key = st.text_input("Google API Key (Gemini)", type="password", key="google_key")
# Initialize API clients
co = None
genai_client = None
if cohere_api_key and google_api_key:
try:
co = cohere.ClientV2(api_key=cohere_api_key)
genai_client = genai.Client(api_key=google_api_key)
except Exception as e:
st.sidebar.error(f"Initialization Failed: {e}")
Image Processing Utilities:
# Resize large images to fit model constraints
def resize_image(pil_image: PIL.Image.Image) -> None:
org_width, org_height = pil_image.size
max_pixels = 1568*1568 # Max resolution
if org_width * org_height > max_pixels:
scale_factor = (max_pixels / (org_width * org_height)) ** 0.5
new_width = int(org_width * scale_factor)
new_height = int(org_height * scale_factor)
pil_image.thumbnail((new_width, new_height))
# Convert images to base64 for API compatibility
def base64_from_image(img_path: str) -> str:
pil_image = PIL.Image.open(img_path)
img_format = pil_image.format if pil_image.format else "PNG"
resize_image(pil_image)
with io.BytesIO() as img_buffer:
pil_image.save(img_buffer, format=img_format)
img_buffer.seek(0)
img_data = f"data:image/{img_format.lower()};base64,"+base64.b64encode(img_buffer.read()).decode("utf-8")
return img_data
Embedding Generation with Cohere:
@st.cache_data(ttl=3600, show_spinner=False)
def compute_image_embedding(base64_img: str, _cohere_client) -> np.ndarray | None:
try:
api_response = _cohere_client.embed(
model="embed-v4.0",
input_type="search_document",
embedding_types=["float"],
images=[base64_img],
)
if api_response.embeddings and api_response.embeddings.float:
return np.asarray(api_response.embeddings.float[0])
else:
st.warning("Could not get embedding. API response might be empty.")
return None
except Exception as e:
st.error(f"Error computing embedding: {e}")
return None
PDF Processing:
def process_pdf_file(pdf_file, cohere_client, base_output_folder="pdf_pages") -> tuple[list[str], list[np.ndarray] | None]:
page_image_paths = []
page_embeddings = []
pdf_filename = pdf_file.name
output_folder = os.path.join(base_output_folder, os.path.splitext(pdf_filename)[0])
os.makedirs(output_folder, exist_ok=True)
try:
# Open PDF from stream
doc = fitz.open(stream=pdf_file.read(), filetype="pdf")
pdf_progress = st.progress(0.0)
for i, page in enumerate(doc.pages()):
page_num = i + 1
page_img_path = os.path.join(output_folder, f"page_{page_num}.png")
page_image_paths.append(page_img_path)
# Render page to image
pix = page.get_pixmap(dpi=150)
pil_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
pil_image.save(page_img_path, "PNG")
# Generate embedding for page image
base64_img = pil_to_base64(pil_image)
emb = compute_image_embedding(base64_img, _cohere_client=cohere_client)
if emb is not None:
page_embeddings.append(emb)
# Update progress
pdf_progress.progress((i + 1) / len(doc))
# Filter out failed embeddings
valid_paths = [path for i, path in enumerate(page_image_paths) if i < len(page_embeddings) and page_embeddings[i] is not None]
valid_embeddings = [emb for emb in page_embeddings if emb is not None]
return valid_paths, valid_embeddings
except Exception as e:
st.error(f"Error processing PDF {pdf_filename}: {e}")
return [], None
Search Function:
def search(question: str, co_client: cohere.Client, embeddings: np.ndarray, image_paths: list[str]) -> str | None:
try:
# Compute embedding for the query
api_response = co_client.embed(
model="embed-v4.0",
input_type="search_query",
embedding_types=["float"],
texts=[question],
)
query_emb = np.asarray(api_response.embeddings.float[0])
# Compute cosine similarities
cos_sim_scores = np.dot(query_emb, embeddings.T)
# Get the most relevant image
top_idx = np.argmax(cos_sim_scores)
hit_img_path = image_paths[top_idx]
return hit_img_path
except Exception as e:
st.error(f"Error during search: {e}")
return None
Answer Generation with Gemini:
def answer(question: str, img_path: str, gemini_client) -> str:
try:
img = PIL.Image.open(img_path)
prompt = [f"""Answer the question based on the following image. Be as elaborate as possible giving extra relevant information.
Don't use markdown formatting in the response.
Please provide enough context for your answer.
Question: {question}""", img]
response = gemini_client.models.generate_content(
model="gemini-2.5-flash-preview-04-17",
contents=prompt
)
llm_answer = response.text
return llm_answer
except Exception as e:
st.error(f"Error during answer generation: {e}")
return f"Failed to generate answer: {e}"
Main RAG Flow:
# When the user asks a question:
if run_button:
if co and genai_client and st.session_state.doc_embeddings is not None:
with st.spinner("Finding relevant image..."):
# Find the most relevant image
top_image_path = search(question, co, st.session_state.doc_embeddings, st.session_state.image_paths)
if top_image_path:
# Display the retrieved image
retrieved_image_placeholder.image(top_image_path, caption=caption, use_container_width=True)
# Generate answer from the image
with st.spinner("Generating answer..."):
final_answer = answer(question, top_image_path, genai_client)
answer_placeholder.markdown(f"**Answer:**\n{final_answer}")
Running the App
With our code in place, it's time to launch the app.
In your terminal, navigate to the project folder, and run the following command
streamlit run vision_rag.py
Streamlit will provide a local URL (typically http://localhost:8501). Open this in your web browser, put in your API key, upload images/PDFs rich in visual content, and it’s ready!
Working Application Demo
Conclusion
You've just built a Vision RAG system that actually works with visual content. This opens up possibilities for analyzing financial reports, technical documentation, research papers - basically anything where important information is shown visually rather than just written out.
To enhance this application further, consider:
Adding confidence scoring to ensure only highly relevant visuals are analyzed
Creating a gallery view that shows multiple potentially relevant images sorted by similarity score rather than just the top result
Implementing a region-of-interest detector that can identify and zoom in on specific chart elements (axes, legends, data points) relevant to the question
Including a feedback loop system that uses user ratings on answers to fine-tune retrieval thresholds and ranking.
Keep experimenting with different configurations and features to build more sophisticated AI applications.
We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.
Reply