unwind ai
Posts
Build a Multimodal AI Agent with Gemini 2.0

Build a Multimodal AI Agent with Gemini 2.0

Fully functional multimodal AI agent using Gemini 2.0 Flash and Phidata (step-by-step instructions)

Shubham Saboo & Gargi Gupta
December 12, 2024

While analyzing videos or searching the web individually is powerful, combining these capabilities opens up entirely new possibilities for AI applications.

In this tutorial, we'll build a Multimodal AI Agent using Google's Gemini 2.0 Flash model that can simultaneously analyze videos and conduct web searches. This powerful combination allows the agent to provide comprehensive responses by understanding both visual content and related web information.

Gemini 2.0 Flash, Google's latest model, brings impressive capabilities to the table. It offers better performance than even the Pro model while being 2x as fast, featuring native image generation, speech synthesis, and built-in tool integration. The best part? the API is free with a generous rate limit while it’s in the experimental phase!

We'll be using the Phidata framework to streamline our agent development and Streamlit for the web interface.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

What We’re Building

This Streamlit application combines video analysis and web search capabilities using Google's Gemini 2.0 model. This agent can analyze uploaded videos and answer questions by combining visual understanding with web-search.

Features

Video analysis using Gemini 2.0 Flash
Web research integration via DuckDuckGo
Support for multiple video formats (MP4, MOV, AVI)
Real-time video processing
Combined visual and textual analysis

Prerequisites

Before we begin, make sure you have the following:

Python installed on your machine (version 3.10 or higher is recommended)
Your Gemini API Key
A code editor of your choice (we recommend VS Code or PyCharm for their excellent Python support)
Basic familiarity with Python programming

Step-by-Step Instructions

Setting Up the Environment

First, let's get our development environment ready:

Clone the GitHub repository:

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git

🌟 Don't forget to star the opensource repo to show your support.

Go to the multimodal_ai_agent folder:

cd ai_agent_tutorials/multimodal_ai_agent

Install the required dependencies:

pip install -r requirements.txt

Get your API Key: Sign up on the Google AI Studio and obtain your Gemini API key.
Set up your Gemini API Key as the environment variable

GOOGLE_API_KEY=your_api_key_here

Creating the Streamlit App

Let’s create our app. Create a new file mutimodal_agent.py and add the following code:

Let's set up our imports and page configuration:
• Streamlit for the interface
• Phidata for AI agents
• Google Gemini as the LLM

import streamlit as st
from phi.agent import Agent
from phi.model.google import Gemini
from phi.tools.duckduckgo import DuckDuckGo
from google.generativeai import upload_file, get_file
import time
from pathlib import Path
import tempfile

st.set_page_config(
    page_title="Multimodal AI Agent",
    page_icon="🧬",
    layout="wide"
)

st.title("Multimodal AI Agent 🧬")

Create the Multimodal Agent:
• Uses Gemini's experimental model
• Includes web search capability
• Caches agent for efficiency

@st.cache_resource
def initialize_agent():
    return Agent(
        name="Multimodal Analyst",
        model=Gemini(id="gemini-2.0-flash-exp"),
        tools=[DuckDuckGo()],
        markdown=True,
    )

agent = initialize_agent()

Set up video file upload:
• Handles multiple video formats
• Creates temporary file
• Displays video preview

uploaded_file = st.file_uploader(
    "Upload a video file", 
    type=['mp4', 'mov', 'avi']
)

if uploaded_file:
    with tempfile.NamedTemporaryFile(delete=False, suffix='.mp4') as tmp_file:
        tmp_file.write(uploaded_file.read())
        video_path = tmp_file.name
    
    st.video(video_path)

Add user input for questions:

user_prompt = st.text_area(
    "What would you like to know?",
    placeholder="Ask any question related to the video...",
    help="You can ask questions about the video content"
)

Process the video and generate analysis:

if st.button("Analyze & Research"):
    if not user_prompt:
        st.warning("Please enter your question.")
    else:
        try:
            with st.spinner("Processing video and researching..."):
                video_file = upload_file(video_path)

Handle video processing state:
• Checks processing status
• Waits for completion
• Updates file status

while video_file.state.name == "PROCESSING":
    time.sleep(2)
    video_file = get_file(video_file.name)

Create the analysis prompt:

prompt = f"""
First analyze this video and then answer the following question using both 
the video analysis and web research: {user_prompt}

Provide a comprehensive response focusing on practical, actionable information.
"""

result = agent.run(prompt, videos=[video_file])

Display results and handle cleanup:

st.subheader("Result")
st.markdown(result.content)

# Cleanup
Path(video_path).unlink(missing_ok=True)

Add custom styling:

st.markdown("""
    <style>
    .stTextArea textarea {
        height: 100px;
    }
    </style>
    """, unsafe_allow_html=True)

Running the App

With our code in place, it's time to launch the app.

In your terminal, navigate to the project folder, and run the following command

streamlit run multimodal_agent.py

Streamlit will provide a local URL (typically http://localhost:8501).

Working Application Demo

Conclusion

You've just built a powerful Multimodal AI Agent that combines video analysis with web search capabilities. This application showcases the potential of Gemini 2.0 Flash and how easily we can create sophisticated AI apps using modern frameworks.

Consider these enhancements to take your agent further:

Add export functionality for analysis results
Enable follow-up questions that maintain context from previous queries about the video
Implement a feature that automatically verifies claims made in the video against web sources

Keep experimenting and refining to build smarter AI solutions!

We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Reply

or to participate.