unwind ai
Posts
Build a Voice RAG Agent

Build a Voice RAG Agent

Fully functional agentic RAG voice app with step-by-step instructions (100% opensource)

Shubham Saboo & Gargi Gupta
March 26, 2025

We've been stuck in text-based AI interfaces for too long. Sure, they work, but they're not the most natural way humans communicate. Now, with OpenAI's new Agents SDK and their recent text-to-speech models, we can build voice applications without drowning in complexity or code.

In this tutorial, we'll build a Multi-agent Voice RAG system that speaks its answers aloud. We'll create a multi-agent workflow where specialized AI agents handle different parts of the process - one agent focuses on processing documentation content, another optimizes responses for natural speech, and finally OpenAI's text-to-speech model delivers the answer in a human-like voice.

Our RAG app uses OpenAI Agents SDK to create and orchestrate these agents that handle different stages of the workflow. OpenAI’s new speech model GPT-4o-mini TTS enhances the overall user experience with a natural, emotion-rich voice. You can easily steer its voice characteristics like the tone, pacing, emotion, and personality traits with simple natural language instructions.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

What We’re Building

This script demonstrates how to build a voice-enabled RAG system using OpenAI's SDK and Streamlit. The application allows users to upload PDF documents, ask questions, and receive both text and voice responses using OpenAI's text-to-speech capabilities.

Features

Multi-agent RAG system with:
- Documentation Processor Agent that analyzes documents and generates clear, informative responses to user queries
- TTS Optimization Agent that refines responses for natural speech patterns with proper pacing and emphasis
PDF document processing and chunking
Qdrant vector database for similarity search
Real-time text-to-speech with multiple voice options
Downloadable audio responses
Support for multiple document uploads

How The App Works

The application follows the user's journey through several key steps:

Initial Setup:
- User enters their API keys (OpenAI and Qdrant) in the sidebar
- The system initializes connections to both services
Document Processing:
- User uploads PDF documents through the Streamlit interface
- Documents are split into chunks with metadata
- Each chunk is embedded and stored in the Qdrant vector database
- The interface tracks processed documents in the sidebar
Query Handling:
- User types a question about the uploaded documents
- The system converts the query to an embedding and searches for relevant document chunks
Agentic Workflow:
- The Document Processor Agent receives the query and relevant document chunks
- It analyzes the content and generates a clear, informative response
- The TTS Optimization Agent then refines this response specifically for speech, adding natural pauses and emphasis
- The OpenAI Agents SDK orchestrates this handoff between the specialized agents
Speech Synthesis:
- The optimized text is sent to GPT-4o-mini-TTS with the user-selected voice
- Audio is generated and played in real-time through the browser
- A downloadable MP3 version is also created

Prerequisites

Before we begin, make sure you have the following:

Python installed on your machine (version 3.10 or higher is recommended)
Your OpenAI API key and Qdrant Cloud API key along with the URL
A code editor of your choice (we recommend VS Code or PyCharm for their excellent Python support)
Basic familiarity with Python programming

Code Walkthrough

Setting Up the Environment

First, let's get our development environment ready:

Clone the GitHub repository:

git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git

🌟 Don't forget to star the opensource repo to show your support.

Go to the voice_rag_openaisdk folder:

cd rag_tutorials/voice_rag_openaisdk

Install the required dependencies:

pip install -r requirements.txt

API Keys: Get your OpenAI API key. Set up a Qdrant Cloud account and get your API key and URL. Create a .env file with your credentials:

OPENAI_API_KEY='your-openai-api-key'
QDRANT_URL='your-qdrant-url'
QDRANT_API_KEY='your-qdrant-api-key'

Creating the Streamlit App

Let’s create our app. Create a new file rag_voice.py and add the following code:

First, import the necessary libraries:

from typing import List, Dict, Optional, Tuple
import os
import tempfile
from datetime import datetime
import uuid
import asyncio

import streamlit as st
from dotenv import load_dotenv
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.http.models import Distance, VectorParams
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from fastembed import TextEmbedding
from openai import AsyncOpenAI
from openai.helpers import LocalAudioPlayer
from agents import Agent, Runner

Initialize the Streamlit session state:

def init_session_state() -> None:
    defaults = {
        "initialized": False,
        "qdrant_url": "",
        "qdrant_api_key": "",
        "openai_api_key": "",
        "setup_complete": False,
        "client": None,
        "embedding_model": None,
        "processor_agent": None,
        "tts_agent": None,
        "selected_voice": "coral",
        "processed_documents": []
    }
    
    for key, value in defaults.items():
        if key not in st.session_state:
            st.session_state[key] = value

Configure the sidebar with API settings and voice options:

def setup_sidebar() -> None:
    with st.sidebar:
        st.title("🔑 Configuration")
        
        st.session_state.qdrant_url = st.text_input("Qdrant URL", type="password")
        st.session_state.qdrant_api_key = st.text_input("Qdrant API Key", type="password")
        st.session_state.openai_api_key = st.text_input("OpenAI API Key", type="password")
        
        voices = ["alloy", "ash", "ballad", "coral", "echo", "fable", "onyx", "nova", "sage", "shimmer", "verse"]
        st.session_state.selected_voice = st.selectbox("Select Voice", options=voices)

Set up the Qdrant vector database:

def setup_qdrant() -> Tuple[QdrantClient, TextEmbedding]:
    client = QdrantClient(
        url=st.session_state.qdrant_url,
        api_key=st.session_state.qdrant_api_key
    )
    
    embedding_model = TextEmbedding()
    test_embedding = list(embedding_model.embed(["test"]))[0]
    embedding_dim = len(test_embedding)
    
    client.create_collection(
        collection_name="voice-rag-agent",
        vectors_config=VectorParams(
            size=embedding_dim,
            distance=Distance.COSINE
        )
    )
    
    return client, embedding_model

Process PDF documents and split into chunks:

def process_pdf(file) -> List:
    with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
        tmp_file.write(file.getvalue())
        loader = PyPDFLoader(tmp_file.name)
        documents = loader.load()
        
        for doc in documents:
            doc.metadata.update({
                "source_type": "pdf",
                "file_name": file.name,
                "timestamp": datetime.now().isoformat()
            })
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )
        return text_splitter.split_documents(documents)

Store document embeddings in Qdrant:

def store_embeddings(client, embedding_model, documents, collection_name):
    for doc in documents:
        embedding = list(embedding_model.embed([doc.page_content]))[0]
        client.upsert(
            collection_name=collection_name,
            points=[
                models.PointStruct(
                    id=str(uuid.uuid4()),
                    vector=embedding.tolist(),
                    payload={
                        "content": doc.page_content,
                        **doc.metadata
                    }
                )
            ]
        )

Set up specialized AI agents:

def setup_agents(openai_api_key: str) -> Tuple[Agent, Agent]:
    """Initialize the processor and TTS agents."""
    processor_agent = Agent(
        name="Documentation Processor",
        instructions="""You are a helpful documentation assistant. Your task is to:
        1. Analyze the provided documentation content
        2. Answer the user's question clearly and concisely
        3. Include relevant examples when available
        4. Cite the source files when referencing specific content
        5. Keep responses natural and conversational
        6. Format your response in a way that's easy to speak out loud""",
        model="gpt-4o"
    )
    
    tts_agent = Agent(
        name="Text-to-Speech Agent",
        instructions="""You are a text-to-speech agent. Your task is to:
        1. Convert the processed documentation response into natural speech
        2. Maintain proper pacing and emphasis
        3. Handle technical terms clearly
        4. Keep the tone professional but friendly
        5. Use appropriate pauses for better comprehension
        6. Ensure the speech is clear and well-articulated""",
        model="gpt-4o"
    )
    
    return processor_agent, tts_agent

Generate query embeddings and search for relevant documents:

async def process_query(query, client, embedding_model, collection_name, openai_api_key, voice):
    query_embedding = list(embedding_model.embed([query]))[0]
    
    search_response = client.query_points(
        collection_name=collection_name,
        query=query_embedding.tolist(),
        limit=3,
        with_payload=True
    )
    
    search_results = search_response.points

Prepare context for the LLM:

context = "Based on the following documentation:\n\n"
    for result in search_results:
        payload = result.payload
        content = payload.get('content', '')
        source = payload.get('file_name', 'Unknown Source')
        context += f"From {source}:\n{content}\n\n"
    
    context += f"\nUser Question: {query}\n\n"
    context += "Please provide a clear, concise answer that can be easily spoken out loud."

Generate text and voice responses:

processor_result = await Runner.run(processor_agent, context)
    text_response = processor_result.final_output
    
    tts_result = await Runner.run(tts_agent, text_response)
    voice_instructions = tts_result.final_output
    
    async_openai = AsyncOpenAI(api_key=openai_api_key)
    
    async with async_openai.audio.speech.with_streaming_response.create(
        model="gpt-4o-mini-tts",
        voice=voice,
        input=text_response,
        instructions=voice_instructions,
        response_format="pcm",
    ) as stream_response:
        await LocalAudioPlayer().play(stream_response)

Set up the main application interface:

def main() -> None:
    st.set_page_config(
        page_title="Voice RAG Agent",
        page_icon="🎙️",
        layout="wide"
    )
    
    init_session_state()
    setup_sidebar()
    
    st.title("🎙️ Voice RAG Agent")
    
    uploaded_file = st.file_uploader("Upload PDF", type=["pdf"])
    
    query = st.text_input(
        "What would you like to know about the documentation?",
        placeholder="e.g., How do I authenticate API requests?",
        disabled=not st.session_state.setup_complete
    )

Running the App

With our code in place, it's time to launch the app.

In your terminal, navigate to the project folder, and run the following command

streamlit run voice_rag_agent.py

Streamlit will provide a local URL (typically http://localhost:8501). Open this in your web browser, put in your API key, and you're ready to upload your documents for voice RAG.

Working Application Demo

Conclusion

You've built a powerful agentic RAG system that doesn't just respond to queries – it speaks them in a natural human voice. The multi-agent approach delivers better results than a single model could, with specialized agents handling different parts of the process.

Want to take it further? Here are some ideas:

Add a voice input agent: Let users speak their questions instead of typing them.
Create a conversation memory agent: Add an agent that tracks conversation history for more contextual responses.
Add speech emotion detection: Analyze user voice input for emotional context and adjust responses accordingly.
Implement multi-language support: Create agents specialized in different languages for a multilingual experience.

Keep experimenting with different agent configurations and features to build more sophisticated AI applications.

We share hands-on tutorials like this 2-3 times a week, to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this tutorial on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Reply

or to participate.