• unwind ai
  • Posts
  • Build a Self-Guided AI Audio Tour Agent

Build a Self-Guided AI Audio Tour Agent

Fully functional AI agent voice app with step-by-step instructions (100% opensource)

When OpenAI released their Agents SDK with the new voice pipeline integration, we immediately saw an opportunity to solve a practical development challenge: how to create dynamic audio content that adapts to user input without constantly re-recording voice talent.

The SDK's lightweight architecture, combined with the expressive capabilities of GPT-4o-mini TTS, provides the perfect toolkit for building voice-based applications that can generate content on demand.

In this tutorial, we'll build a Self-Guided AI Audio Tour Agent - a conversational voice system that generates personalized audio tours based on a user's location, interests, and preferred tour duration.

Our multi-agent architecture leverages the OpenAI Agents SDK to create specialized agents that handle different aspects of tour content, from historical information to architectural details, culinary recommendations, and cultural insights.

OpenAI’s new speech model GPT-4o-mini TTS enhances the overall user experience with natural emotion-rich voice. One of the most powerful features is how easily you can steer the voice characteristics with simple natural language instructions - you can adjust tone, pacing, emotion, and personality traits without complex parameter tuning.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

What We’re Building

This Streamlit application creates a personalized audio tour guide system using the OpenAI Agents SDK and GPT-4o-mini TTS. The application generates location-specific content tailored to user interests and converts it to natural-sounding speech.

Features

  1. Multi-agent architecture with specialized content generators

    • Orchestrator Agent
      Coordinates the overall tour flow, manages transitions, and assembles content from all expert agents.

    • History Agent
      Delivers insightful historical narratives with an authoritative voice.

    • Architecture Agent
      Highlights architectural details, styles, and design elements using a descriptive and technical tone.

    • Culture Agent
      Explores local customs, traditions, and artistic heritage with an enthusiastic voice.

    • Culinary Agent
      Describes iconic dishes and food culture in a passionate and engaging tone.

  2. Location-aware content generation with web search integration

  3. Customizable tour duration (from 5 to 60 minutes)

  4. High-quality audio output using GPT-4o-mini TTS

  5. User-friendly interface with Streamlit

  6. Dynamic content generation based on user-input location

  7. Real-time web search integration to fetch relevant, up-to-date details

  8. Personalized content delivery filtered by user interest categories

How The App Works

This is how the user flow and the multi-agent workflow happens:

  1. User Input: The user enters three key pieces of information:

    • A location they want to explore (city, landmark, neighborhood)

    • Topics they're interested in (History, Architecture, Culture, Culinary)

    • Desired tour length (5-60 minutes)

  2. Planning Phase: The Planner Agent calculates time allocations for each content section based on user interests and tour duration.

  3. Content Generation: Specialized agents (Architecture, History, Culture, Culinary) generate content for their domains. Each agent:

    • Uses web search to get up-to-date information about the location

    • Formats content to be conversational and suitable for audio

    • Stays within word limits calculated for the tour duration

  4. Tour Assembly: The Orchestrator Agent combines all sections with smooth transitions.

  5. Text-to-Speech Conversion: The finalized tour text is converted to natural-sounding audio using GPT-4o-mini TTS with the "nova" voice.

Prerequisites

Before we begin, make sure you have the following:

  1. Python installed on your machine (version 3.10 or higher is recommended)

  2. Your OpenAI API key (or any LLM of your choice)

  3. A code editor of your choice (we recommend VS Code or PyCharm for their excellent Python support)

  4. Basic familiarity with Python programming

Step-by-Step Instructions

Setting Up the Environment

First, let's get our development environment ready:

  1. Clone the GitHub repository:

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
  1. Go to the ai_audio_tour_agent folder:

cd ai_agent_tutorials/ai_audio_tour_agent
pip install -r requirements.txt
  1. Grab your OpenAI API key.

Creating the Streamlit App

Let's explore how our application works by examining the key components one by one.

Create a new file agent.py to define all the specialized agents for our audio tour

  1. First, let's look at the Architecture Agent:

ARCHITECTURE_AGENT_INSTRUCTIONS = """
You are the Architecture agent for a self-guided audio tour system.
Given a location and the areas of interest of user, your role is to:

1. Describe architectural styles, notable buildings, urban planning, and design elements
2. Provide technical insights balanced with accessible explanations
3. Highlight the most visually striking or historically significant structures
4. Adopt a detailed, descriptive voice style when delivering architectural content
...
"""

class Architecture(BaseModel):
    output: str

architecture_agent = Agent(
    name="ArchitectureAgent",
    instructions=ARCHITECTURE_AGENT_INSTRUCTIONS,
    model="gpt-4o-mini",
    tools=[WebSearchTool()],
    model_settings=ModelSettings(tool_choice="required"),
    output_type=Architecture
)
  1. Similar agents are defined for History, Culture, and Culinary content, each with specific instructions and output types.

  2. The Planner Agent manages time allocation:

planner_agent = Agent(
    name="PlannerAgent",
    instructions=PLANNER_INSTRUCTIONS,
    model="gpt-4o",
    output_type=Planner,
)
  1. And the Orchestrator Agent assembles everything:

orchestrator_agent = Agent(
    name="OrchestratorAgent",
    instructions=ORCHESTRATOR_INSTRUCTIONS,
    model="gpt-4o-mini",
    output_type=FinalTour,
)

Create a new file manager.py to define the TourManager.

  1. The TourManager class that orchestrates the execution flow between agents:

class TourManager:
    """Orchestrates the full flow"""
    
    def __init__(self) -> None:
        self.console = Console()
        self.printer = Printer(self.console)
    
    async def run(self, query: str, interests: list, duration: str) -> None:
        # Calculate word limits based on duration
        words_per_minute = 150
        total_words = int(duration) * words_per_minute
        words_per_section = total_words // len(interests)
        
        # Only research selected interests
        if "Architecture" in interests:
            research_results["architecture"] = await self._get_architecture(query, interests, words_per_section)
        if "History" in interests:
            research_results["history"] = await self._get_history(query, interests, words_per_section)
        # ...and so on for other interests
        
        # Get final tour with only selected interests
        final_tour = await self._get_final_tour(query, interests, duration, research_results)
        
        # Build final tour content
        sections = []
        for interest in interests:
            if interest.lower() in research_results:
                sections.append(research_results[interest.lower()].output)
                
        # Format final tour with natural transitions
        final = ""
        for i, content in enumerate(sections):
            if i > 0:
                final += "\n\n"  # Add spacing between sections
            final += content
            
        return final

The manager calls specialized methods for each content type and calculates word counts based on the requested tour duration.

Create a new file ai_audio_tour_agent.py for our main application.

  1. Here's the function that converts the generated tour text to speech:

def tts(text):
    from pathlib import Path
    from openai import OpenAI
    
    client = OpenAI()
    speech_file_path = Path(__file__).parent / f"speech_tour.mp3"
    
    response = client.audio.speech.create(
        model="gpt-4o-mini-tts",
        voice="nova",
        input=text,
        instructions="""You are a friendly and engaging tour guide. Speak
naturally and conversationally, as if you're walking alongside the visitor.
Use a warm, inviting tone throughout. Avoid robotic or formal language.
Make the tour feel like a casual conversation with a knowledgeable friend.
Use natural transitions between topics and maintain an enthusiastic but
relaxed pace."""
    )
    
    response.stream_to_file(speech_file_path)
    return speech_file_path
  1. The Streamlit UI provides a clean interface for users to input preferences:

# Create a clean layout with cards
col1, col2 = st.columns([2, 1])

with col1:
    st.markdown("### 📍 Where would you like to explore?")
    location = st.text_input("", placeholder="Enter a city, landmark, or location...")
    
    st.markdown("### 🎯 What interests you?")
    interests = st.multiselect(
        "",
        options=["History", "Architecture", "Culinary", "Culture"],
        default=["History", "Architecture"],
        help="Select the topics you'd like to learn about"
    )

with col2:
    st.markdown("### ⏱️ Tour Settings")
    duration = st.slider(
        "Tour Duration (minutes)",
        min_value=5,
        max_value=60,
        value=10,
        step=5,
        help="Choose how long you'd like your tour to be"
    )
  1. The tour generation process is triggered when the user clicks the "Generate Tour" button:

if st.button("🎧 Generate Tour", type="primary"):
    # ... validation checks
    
    with st.spinner(f"Creating your personalized tour of {location}..."):
        mgr = TourManager()
        final_tour = run_async(mgr.run, location, interests, duration)
    
    # Display the tour content
    with st.expander("📝 Tour Content", expanded=True):
        st.markdown(final_tour)
    
    # Generate and display audio
    with st.spinner("🎙️ Generating audio tour..."):
        progress_bar = st.progress(0)
        tour_audio = tts(final_tour)
        progress_bar.progress(100)
    
    st.markdown("### 🎧 Listen to Your Tour")
    st.audio(tour_audio, format="audio/mp3")
    
    # Add download button
    with open(tour_audio, "rb") as file:
        st.download_button(
            label="📥 Download Audio Tour",
            data=file,
            file_name=f"{location.lower().replace(' ', '_')}_tour.mp3",
            mime="audio/mp3"
        )

Running the App

With our code in place, it's time to launch the app.

  • In your terminal, navigate to the project folder, and run the following command

streamlit run ai_audio_tour_agent.py
  • Streamlit will provide a local URL (typically http://localhost:8501). Open this in your web browser, put in your API key, and you're ready to create your first tour.

Working Application Demo

Conclusion

You've built a powerful audio tour generation system that uses OpenAI Agents SDK and OpenAI’s latest TTS capabilities. This application demonstrates several important concepts:

  • How to coordinate multiple specialized agents to tackle complex tasks

  • How to use web search tools to provide up-to-date information

  • How to generate content optimized for audio delivery

  • How to convert text to natural-sounding speech using GPT-4o-mini TTS

To enhance this project further, consider:

  • Adding location detection: Integrate with mapping APIs to suggest nearby points of interest

  • Implementing voice input: Allow users to speak their preferences rather than typing

  • Creating step-by-step tours: Generate directions between landmarks for walking tours

  • Supporting multiple languages: Translate tours for international travelers

This tour guide is just one example of what's possible with these technologies. Keep experimenting with different agent configurations and features to build more sophisticated AI applications.

We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Reply

or to participate.