- unwind ai
- Posts
- Build a Self-Guided AI Audio Tour Agent
Build a Self-Guided AI Audio Tour Agent
Fully functional AI agent voice app with step-by-step instructions (100% opensource)
When OpenAI released their Agents SDK with the new voice pipeline integration, we immediately saw an opportunity to solve a practical development challenge: how to create dynamic audio content that adapts to user input without constantly re-recording voice talent.
The SDK's lightweight architecture, combined with the expressive capabilities of GPT-4o-mini TTS, provides the perfect toolkit for building voice-based applications that can generate content on demand.
In this tutorial, we'll build a Self-Guided AI Audio Tour Agent - a conversational voice system that generates personalized audio tours based on a user's location, interests, and preferred tour duration.
Our multi-agent architecture leverages the OpenAI Agents SDK to create specialized agents that handle different aspects of tour content, from historical information to architectural details, culinary recommendations, and cultural insights.
OpenAI’s new speech model GPT-4o-mini TTS enhances the overall user experience with natural emotion-rich voice. One of the most powerful features is how easily you can steer the voice characteristics with simple natural language instructions - you can adjust tone, pacing, emotion, and personality traits without complex parameter tuning.
What We’re Building
This Streamlit application creates a personalized audio tour guide system using the OpenAI Agents SDK and GPT-4o-mini TTS. The application generates location-specific content tailored to user interests and converts it to natural-sounding speech.
Features
Multi-agent architecture with specialized content generators
Orchestrator Agent
Coordinates the overall tour flow, manages transitions, and assembles content from all expert agents.History Agent
Delivers insightful historical narratives with an authoritative voice.Architecture Agent
Highlights architectural details, styles, and design elements using a descriptive and technical tone.Culture Agent
Explores local customs, traditions, and artistic heritage with an enthusiastic voice.Culinary Agent
Describes iconic dishes and food culture in a passionate and engaging tone.
Location-aware content generation with web search integration
Customizable tour duration (from 5 to 60 minutes)
High-quality audio output using GPT-4o-mini TTS
User-friendly interface with Streamlit
Dynamic content generation based on user-input location
Real-time web search integration to fetch relevant, up-to-date details
Personalized content delivery filtered by user interest categories
How The App Works
This is how the user flow and the multi-agent workflow happens:
User Input: The user enters three key pieces of information:
A location they want to explore (city, landmark, neighborhood)
Topics they're interested in (History, Architecture, Culture, Culinary)
Desired tour length (5-60 minutes)
Planning Phase: The Planner Agent calculates time allocations for each content section based on user interests and tour duration.
Content Generation: Specialized agents (Architecture, History, Culture, Culinary) generate content for their domains. Each agent:
Uses web search to get up-to-date information about the location
Formats content to be conversational and suitable for audio
Stays within word limits calculated for the tour duration
Tour Assembly: The Orchestrator Agent combines all sections with smooth transitions.
Text-to-Speech Conversion: The finalized tour text is converted to natural-sounding audio using GPT-4o-mini TTS with the "nova" voice.
Prerequisites
Before we begin, make sure you have the following:
Python installed on your machine (version 3.10 or higher is recommended)
Your OpenAI API key (or any LLM of your choice)
A code editor of your choice (we recommend VS Code or PyCharm for their excellent Python support)
Basic familiarity with Python programming
Step-by-Step Instructions
Setting Up the Environment
First, let's get our development environment ready:
Clone the GitHub repository:
git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
Go to the ai_audio_tour_agent folder:
cd ai_agent_tutorials/ai_audio_tour_agent
Install the required dependencies:
pip install -r requirements.txt
Grab your OpenAI API key.
Creating the Streamlit App
Let's explore how our application works by examining the key components one by one.
Create a new file agent.py
to define all the specialized agents for our audio tour
First, let's look at the Architecture Agent:
ARCHITECTURE_AGENT_INSTRUCTIONS = """
You are the Architecture agent for a self-guided audio tour system.
Given a location and the areas of interest of user, your role is to:
1. Describe architectural styles, notable buildings, urban planning, and design elements
2. Provide technical insights balanced with accessible explanations
3. Highlight the most visually striking or historically significant structures
4. Adopt a detailed, descriptive voice style when delivering architectural content
...
"""
class Architecture(BaseModel):
output: str
architecture_agent = Agent(
name="ArchitectureAgent",
instructions=ARCHITECTURE_AGENT_INSTRUCTIONS,
model="gpt-4o-mini",
tools=[WebSearchTool()],
model_settings=ModelSettings(tool_choice="required"),
output_type=Architecture
)
Similar agents are defined for History, Culture, and Culinary content, each with specific instructions and output types.
The Planner Agent manages time allocation:
planner_agent = Agent(
name="PlannerAgent",
instructions=PLANNER_INSTRUCTIONS,
model="gpt-4o",
output_type=Planner,
)
And the Orchestrator Agent assembles everything:
orchestrator_agent = Agent(
name="OrchestratorAgent",
instructions=ORCHESTRATOR_INSTRUCTIONS,
model="gpt-4o-mini",
output_type=FinalTour,
)
Create a new file manager.py
to define the TourManager.
The TourManager class that orchestrates the execution flow between agents:
class TourManager:
"""Orchestrates the full flow"""
def __init__(self) -> None:
self.console = Console()
self.printer = Printer(self.console)
async def run(self, query: str, interests: list, duration: str) -> None:
# Calculate word limits based on duration
words_per_minute = 150
total_words = int(duration) * words_per_minute
words_per_section = total_words // len(interests)
# Only research selected interests
if "Architecture" in interests:
research_results["architecture"] = await self._get_architecture(query, interests, words_per_section)
if "History" in interests:
research_results["history"] = await self._get_history(query, interests, words_per_section)
# ...and so on for other interests
# Get final tour with only selected interests
final_tour = await self._get_final_tour(query, interests, duration, research_results)
# Build final tour content
sections = []
for interest in interests:
if interest.lower() in research_results:
sections.append(research_results[interest.lower()].output)
# Format final tour with natural transitions
final = ""
for i, content in enumerate(sections):
if i > 0:
final += "\n\n" # Add spacing between sections
final += content
return final
The manager calls specialized methods for each content type and calculates word counts based on the requested tour duration.
Create a new file ai_audio_tour_agent.py
for our main application.
Here's the function that converts the generated tour text to speech:
def tts(text):
from pathlib import Path
from openai import OpenAI
client = OpenAI()
speech_file_path = Path(__file__).parent / f"speech_tour.mp3"
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="nova",
input=text,
instructions="""You are a friendly and engaging tour guide. Speak
naturally and conversationally, as if you're walking alongside the visitor.
Use a warm, inviting tone throughout. Avoid robotic or formal language.
Make the tour feel like a casual conversation with a knowledgeable friend.
Use natural transitions between topics and maintain an enthusiastic but
relaxed pace."""
)
response.stream_to_file(speech_file_path)
return speech_file_path
The Streamlit UI provides a clean interface for users to input preferences:
# Create a clean layout with cards
col1, col2 = st.columns([2, 1])
with col1:
st.markdown("### 📍 Where would you like to explore?")
location = st.text_input("", placeholder="Enter a city, landmark, or location...")
st.markdown("### 🎯 What interests you?")
interests = st.multiselect(
"",
options=["History", "Architecture", "Culinary", "Culture"],
default=["History", "Architecture"],
help="Select the topics you'd like to learn about"
)
with col2:
st.markdown("### ⏱️ Tour Settings")
duration = st.slider(
"Tour Duration (minutes)",
min_value=5,
max_value=60,
value=10,
step=5,
help="Choose how long you'd like your tour to be"
)
The tour generation process is triggered when the user clicks the "Generate Tour" button:
if st.button("🎧 Generate Tour", type="primary"):
# ... validation checks
with st.spinner(f"Creating your personalized tour of {location}..."):
mgr = TourManager()
final_tour = run_async(mgr.run, location, interests, duration)
# Display the tour content
with st.expander("📝 Tour Content", expanded=True):
st.markdown(final_tour)
# Generate and display audio
with st.spinner("🎙️ Generating audio tour..."):
progress_bar = st.progress(0)
tour_audio = tts(final_tour)
progress_bar.progress(100)
st.markdown("### 🎧 Listen to Your Tour")
st.audio(tour_audio, format="audio/mp3")
# Add download button
with open(tour_audio, "rb") as file:
st.download_button(
label="📥 Download Audio Tour",
data=file,
file_name=f"{location.lower().replace(' ', '_')}_tour.mp3",
mime="audio/mp3"
)
Running the App
With our code in place, it's time to launch the app.
In your terminal, navigate to the project folder, and run the following command
streamlit run ai_audio_tour_agent.py
Streamlit will provide a local URL (typically http://localhost:8501). Open this in your web browser, put in your API key, and you're ready to create your first tour.
Working Application Demo
Conclusion
You've built a powerful audio tour generation system that uses OpenAI Agents SDK and OpenAI’s latest TTS capabilities. This application demonstrates several important concepts:
How to coordinate multiple specialized agents to tackle complex tasks
How to use web search tools to provide up-to-date information
How to generate content optimized for audio delivery
How to convert text to natural-sounding speech using GPT-4o-mini TTS
To enhance this project further, consider:
Adding location detection: Integrate with mapping APIs to suggest nearby points of interest
Implementing voice input: Allow users to speak their preferences rather than typing
Creating step-by-step tours: Generate directions between landmarks for walking tours
Supporting multiple languages: Translate tours for international travelers
This tour guide is just one example of what's possible with these technologies. Keep experimenting with different agent configurations and features to build more sophisticated AI applications.
We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.
Reply