unwind ai
Posts
Build a Multimodal AI Coding Agent Team with o3-mini and Gemini 2.0

Build a Multimodal AI Coding Agent Team with o3-mini and Gemini 2.0

Fully functional AI agent app with step-by-step instructions (100% opensource)

Shubham Saboo & Gargi Gupta
February 12, 2025

When solving coding problems, developers often encounter them in different formats - whether as text descriptions, screenshots from documentation, or images from whiteboards. Having a tool that can understand these different formats and help generate optimal solutions can significantly speed up the development process.

In this tutorial, we'll build a powerful multimodal coding assistant that combines three specialized AI agents working together: a Vision Agent for processing images of coding problems, a Coding Agent for generating optimal solutions, and an Execution Agent for running and analyzing code in a sandbox environment.

We're using the Agno (prev. Phidata) framework to manage our multi-agent system, leveraging its ability to coordinate multiple AI models effectively. This setup allows each agent to focus on its specialty while working together seamlessly to solve coding problems.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

What We’re Building

This Streamlit application brings together three specialized AI agents working as a team to solve coding problems:

Vision Agent (using Gemini 2.0 Pro): Handles image processing, extracting coding problems and requirements from uploaded screenshots or pictures
Coding Agent (using o3-mini): Generates optimized code solutions with proper documentation and type hints
Execution Agent (using o3-mini + E2B): Runs the generated code in a secure sandbox environment and provides execution results and error analysis

Users can submit problems either as text descriptions or images, and the appropriate agent takes charge based on the input type.

Features:

Multi-modal input support (images and text)
Intelligent code generation with optimal time/space complexity
Secure sandboxed code execution
Real-time execution results and error handling
Interactive problem processing with multiple specialized agents

Tech Stack

The core of our application relies on OpenAI's new o3-mini model, which excels at STEM reasoning and coding tasks while delivering faster responses than its predecessors. o3-mini supports crucial developer features like function calling and structured outputs, making it ideal for generating optimized code solutions with proper documentation and type hints.

For image understanding, we're using Google's Gemini 2.0 Pro experimental model, which offers powerful multimodal capabilities and can effectively extract coding problems from images. The model comes with a large 2-million token context window and demonstrates superior performance in understanding complex technical content and code-related tasks.

Our application's architecture is built using Agno (prev. Phidata), a lightweight framework designed for creating multi-agent systems with superior performance, managing agent state and coordination with minimal overhead.

We're also integrating E2B, which provides secure sandboxed environments for executing AI-generated code, offering isolation and safety features crucial for running untrusted code with a quick startup time of ~150ms.

Prerequisites

Before we begin, make sure you have the following:

Python installed on your machine (version 3.10 or higher is recommended)
Your OpenAI, Google, and E2B API key
A code editor of your choice (we recommend VS Code or PyCharm for their excellent Python support)
Basic familiarity with Python programming

Step-by-Step Instructions

Setting Up the Environment

First, let's get our development environment ready:

Clone the GitHub repository:

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git

🌟 Don't forget to star the opensource repo to show your support.

Go to the ai_coding_agent_o3-mini folder:

cd ai_agent_tutorials/ai_coding_agent_o3-mini

Install the required dependencies:

pip install -r requirements.txt

Obtain your API keys:
- Get an OpenAI API key from https://platform.openai.com/
- Get a Google (Gemini) API key from https://makersuite.google.com/app/apikey
- Get an E2B API key from https://e2b.dev/docs/getting-started/api-key

Creating the Streamlit App

Let’s create our app. Create a new file ai_coding_agent_o3.py and add the following code:

Let's set up our imports:

from typing import Optional, Dict, Any
import streamlit as st
from agno.agent import Agent, RunResponse
from agno.models.openai import OpenAIChat
from agno.models.google import Gemini
from e2b_code_interpreter import Sandbox
import os
from PIL import Image
from io import BytesIO

Initialize session state and setup configuration:

def initialize_session_state() -> None:
    if 'openai_key' not in st.session_state:
        st.session_state.openai_key = ''
    if 'gemini_key' not in st.session_state:
        st.session_state.gemini_key = ''
    if 'e2b_key' not in st.session_state:
        st.session_state.e2b_key = ''
    if 'sandbox' not in st.session_state:
        st.session_state.sandbox = None

def setup_sidebar() -> None:
    with st.sidebar:
        st.title("API Configuration")
        st.session_state.openai_key = st.text_input("OpenAI API Key", 
                                                   value=st.session_state.openai_key,
                                                   type="password")
        st.session_state.gemini_key = st.text_input("Gemini API Key", 
                                                   value=st.session_state.gemini_key,
                                                   type="password")
        st.session_state.e2b_key = st.text_input("E2B API Key",
                                                value=st.session_state.e2b_key,
                                                type="password")

Create specialized AI agents:

def create_agents() -> tuple[Agent, Agent, Agent]:
    vision_agent = Agent(
        model=Gemini(id="gemini-2.0-pro-exp-02-05", api_key=st.session_state.gemini_key),
        markdown=True,
    )

    coding_agent = Agent(
        model=OpenAIChat(
            id="o3-mini", 
            api_key=st.session_state.openai_key,
            system_prompt="""You are an expert Python programmer..."""
        ),
        markdown=True
    )
    
    execution_agent = Agent(
        model=OpenAIChat(
            id="o3-mini",
            api_key=st.session_state.openai_key,
            system_prompt="""You are an expert at executing Python code..."""
        ),
        markdown=True
    )
    
    return vision_agent, coding_agent, execution_agent

Initialize and manage the sandbox environment:

def initialize_sandbox() -> None:
    try:
        if st.session_state.sandbox:
            try:
                st.session_state.sandbox.close()
            except:
                pass
        os.environ['E2B_API_KEY'] = st.session_state.e2b_key
        st.session_state.sandbox = Sandbox(timeout=60)
    except Exception as e:
        st.error(f"Failed to initialize sandbox: {str(e)}")
        st.session_state.sandbox = None

def run_code_in_sandbox(code: str) -> Dict[str, Any]:
    if not st.session_state.sandbox:
        initialize_sandbox()
    
    execution = st.session_state.sandbox.run_code(code)
    return {
        "logs": execution.logs,
        "files": st.session_state.sandbox.files.list("/")
    }

Process images with Gemini 2.0 Pro:

def process_image_with_gemini(vision_agent: Agent, image: Image) -> str:
    prompt = """Analyze this image and extract any coding problem..."""
    
    temp_path = "temp_image.png"
    try:
        if image.mode != 'RGB':
            image = image.convert('RGB')
        image.save(temp_path, format="PNG")
        
        with open(temp_path, 'rb') as img_file:
            img_bytes = img_file.read()
            
        response = vision_agent.run(
            prompt,
            images=[{"filepath": temp_path}]
        )
        return response.content
    except Exception as e:
        return "Failed to process the image..."
    finally:
        if os.path.exists(temp_path):
            os.remove(temp_path)

Execute code with error handling:

def execute_code_with_agent(execution_agent: Agent, code: str, sandbox: Sandbox) -> str:
    try:
        sandbox.set_timeout(30)
        execution = sandbox.run_code(code)
        
        if execution.error:
            if "TimeoutException" in str(execution.error):
                return "⚠️ Execution Timeout..."
            
            error_prompt = f"""The code execution resulted in an error..."""
            response = execution_agent.run(error_prompt)
            return f"⚠️ Execution Error:\n{response.content}"
        
        try:
            files = sandbox.files.list("/")
        except:
            files = []
        
        prompt = f"""Here is the code execution result..."""
        response = execution_agent.run(prompt)
        return response.content
    except Exception as e:
        return f"⚠️ Sandbox Error: {str(e)}"

Main application setup:

def main() -> None:
    st.title("O3-Mini Coding Agent")
    
    initialize_session_state()
    setup_sidebar()
    with st.sidebar:
        st.info("⏱️ Code execution timeout: 30 seconds")
    
    if not (st.session_state.openai_key and 
            st.session_state.gemini_key and 
            st.session_state.e2b_key):
        st.warning("Please enter all required API keys in the sidebar.")
        return
    
    vision_agent, coding_agent, execution_agent = create_agents()

Handle user input:

uploaded_image = st.file_uploader(
        "Upload an image of your coding problem (optional)",
        type=['png', 'jpg', 'jpeg']
    )
    
    if uploaded_image:
        st.image(uploaded_image, caption="Uploaded Image", use_container_width=True)
    
    user_query = st.text_area(
        "Or type your coding problem here:",
        placeholder="Example: Write a function to find the sum of two numbers...",
        height=100
    )

Process user input and generate solutions:

if st.button("Generate & Execute Solution", type="primary"):
        if uploaded_image and not user_query:
            with st.spinner("Processing image..."):
                image = Image.open(uploaded_image)
                extracted_query = process_image_with_gemini(vision_agent, image)
                response = coding_agent.run(extracted_query)
                
        elif user_query and not uploaded_image:
            with st.spinner("Generating solution..."):
                response = coding_agent.run(user_query)

Execute and display results:

if 'response' in locals():
            st.divider()
            st.subheader("💻 Solution")
            
            code_blocks = response.content.split("```python")
            if len(code_blocks) > 1:
                code = code_blocks[1].split("```")[0].strip()
                st.code(code, language="python")
                
                with st.spinner("Executing code..."):
                    initialize_sandbox()
                    execution_results = execute_code_with_agent(
                        execution_agent,
                        code,
                        st.session_state.sandbox
                    )
                    st.markdown(execution_results)

Running the App

With our code in place, it's time to launch the app.

In your terminal, navigate to the project folder, and run the following command

streamlit run ai_coding_agent_o3.py

Streamlit will provide a local URL (typically http://localhost:8501).

Working Application Demo

Conclusion

You've successfully built a powerful multimodal coding assistant that combines three specialized AI agents to help solve coding problems. Using o3-mini's coding expertise, Gemini 2.0's vision capabilities, and E2B's secure execution environment, your tool can take coding problems in any format and turn them into working solutions.

For further enhancements, consider:

Extending the Coding Agent to handle different programming languages beyond Python
Adding an agent specifically for generating comprehensive test cases for the solutions
Implementing a storage system to save and categorize previously solved problems
Allowing users to provide specific requirements or preferences for code style and documentation

Keep experimenting and have fun watching these AI agents work together!

We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Reply

or to participate.