• unwind ai
  • Posts
  • Build Voice AI Agents with No-Code

Build Voice AI Agents with No-Code

PLUS: Amazon competes with OpenAI and Google Realtime models, MCP security vulnerabilities

In partnership with

Today’s top AI Highlights:

  1. Build voice AI agents with a drag-and-drop no-code builder

  2. Zapier and WhatsApp MCP servers are not as secure as you thought

  3. Amazon brought speech-to-text, LLM, and text-to-speech in one model

  4. AI models can now generate 1-minute-long videos in a single shot

  5. Opensource alternative to OpenAI Operator that can use computers

& so much more!

Read time: 3 mins

AI Blogs

Meta released its new Llama 4 Scout model with a massive 10 million token context window, and the tech bros didn’t take a second to declare that "RAG is dead," but this couldn't be further from the truth. This blog cuts through the hype to explain why RAG isn't just about extending context windows – it's fundamentally about knowledge organization, information retrieval, and knowledge updates that remain essential regardless of context size.

We broke down the hidden limitations of these super-sized models, including the gaps between claimed capabilities and actual performance, and the substantial computational resources they require.

The future isn't about choosing between massive context windows or RAG – it's about intelligent hybrid approaches that combine the precision and freshness of retrieval with the synthesis power of large context models

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Latest Developments

MCP is quickly becoming the go-to standard for connecting AI agents to external tools, streamlining how these systems interact with the real world. Major players like OpenAI and Google are throwing their weight behind it. MCP's increasing adoption brings convenience, but also introduces new security challenges that developers should be aware of.

A fundamental problem with MCP's security model is that it assumes tool descriptions are trustworthy and benign, which recent analyses have shown to be false. An analysis by Invariant Labs exposed critical vulnerabilities that allow attackers to inject malicious instructions into seemingly innocent tools, compromising sensitive data while flying completely under users' radar.

Key Highlights:

  1. Tool Poisoning Attacks - Attackers can hide malicious code in tool descriptions that are visible to AI models but not to users. These hidden instructions can direct AI to access sensitive files (SSH keys, config files) and exfiltrate data while maintaining a facade of legitimate operation.

  2. Rug Pulls & Sleeper Attacks - MCP servers can change tool descriptions after initial approval without notifying users. A server might appear harmless at installation but activate malicious instructions later, bypassing security checks entirely.

  3. Cross-Server Contamination - When an agent connects to multiple MCP servers, malicious servers can inject instructions that modify how the agent interacts with trusted servers. In one demonstration, the team exfiltrated WhatsApp chat histories by manipulating how an agent interacted with a legitimate WhatsApp MCP server instance.

  4. Security Recommendations - Implement full transparency of tool descriptions in your UIs, pin server versions to prevent unauthorized changes, enforce strict isolation between different MCP servers, and deploy comprehensive agent guardrails that validate all tool interactions.

The #1 AI Meeting Assistant

Still taking manual meeting notes in 2025? Let AI handle the tedious work so you can focus on the important stuff.

Fellow is the AI meeting assistant that:

✔️ Auto-joins your Zoom, Google Meet, and Teams calls to take notes for you.
✔️ Tracks action items and decisions so nothing falls through the cracks.
✔️ Answers questions about meetings and searches through your transcripts, like ChatGPT.

Try Fellow today and get unlimited AI meeting notes for 30 days.

If you’ve waited to build voice AI agents until OpenAI released a new Voice Pipeline in its Agents SDK, then you’ve surely missed out on great tools. One of them is Vogent, a platform for building and serving voice AI agents that provides higher-level building blocks to get a voice agent working quickly and easily.

It supports the typical design process of a voice agent with model selection, voice customization, and hosting options via phone numbers or API access, along with other great features like a drag-and-drop no-code flow builder, tooling, adding RAG capabilities, counterfactual analysis, and call transfers.

Key Highlights:

  1. Flow Builder Interface - A drag-and-drop builder that creates structured conversations while maintaining flexibility. Each node focuses on a specific goal (like asking a question) while seamlessly transitioning when that goal is achieved, making it easy to build multi-step conversations that feel natural.

  2. Spelling-Optimized Voices - Choose from multiple voice providers including Cartesia, OpenAI GPT-4o speech models, Sesame speech models, or bring your own custom models. Vogent also has custom-trained voices that don't sound artificial when spelling words or numbers — a critical issue that "killed almost every engagement."

  3. IVR Detection Model - Intelligent system that analyzes audio streams to distinguish between automated phone systems and humans, enabling agents to switch between different LLMs optimized for each scenario for better performance.

  4. Testing and Versioning Tools - Built-in support for model versioning, counterfactual testing against past call recordings, and detailed call analytics to continuously improve agent performance in real-world scenarios.

Quick Bites

AI models can now generate 1-minute-long videos from a text prompt in one shot. Researchers from NVIDIA, Stanford, UCSD, UC Berkeley, and UT Austin have introduced Test-Time Training (TTT) that adds neural network-based hidden states to pre-trained Diffusion Transformers to handle long video contexts. This method generated more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. The code is available publicly for research. Do check out their Tom and Jerry examples, they are incredible!

Jina AI has released jina-reranker-m0, a multilingual multimodal reranker that can rank visually rich documents across 29+ languages. This new reranker excels at handling documents containing text, figures, tables, and various layouts while achieving top performance on both visual retrieval benchmarks and text-only tasks, including code search. Available via Jina's API with 1 million free tokens for new users.

Amazon has released Nova Sonic, a foundation model that integrates speech understanding and generation into a single system, capturing not just words but also tone, inflection, and pacing for more natural voice interactions. This eliminates the need to stitch multiple models (speech recognition, LLMs, text-to-speech) for a voice pipeline. The model competes strongly with OpenAI and Google’s Realtime models (GPT-4o and Gemini 2.0) while being ~80% cheaper than OpenAI. It is now available on Amazon Bedrock.

Remember that disc-like "opensource brain interface" Omi that was launched in the recent flood of AI wearables? The team has released SDKs for developers to create custom applications that can capture, transcribe, and analyze conversations in real-time. The toolkits connect to Omi devices via Bluetooth, process Opus-encoded audio, and use Deepgram model for transcription. These apps can also be monetized through their marketplace with 50+ applications already available.

Tools of the Trade

  1. Browser MCP: Allows MCP clients like Claude, Windsurf, Cursor, and VS Code to interact directly with your browser. It runs locally on your machine, uses your existing browser profile so you are logged into all the services, and avoids basic bot detection and captchas.

  2. Spongecake: Opensource tool to build their own computer-using AI agents similar to OpenAI's Operator, with a Next.js/React frontend and Flask backend to automate desktop application interactions. It provides a virtual desktop environment where you can create automation workflows for applications with poor APIs or for enterprises with restrictive environments

  3. Web2llm: Scrapes web documentation into markdown files to keep LLMs and agents updated with the latest docs. Browse 100s of pre-scraped documentation or add your own docs to scrape, and easily integrate the markdown content into prompts to vibe code.

  4. Awesome LLM Apps: Build awesome LLM apps with RAG, AI agents, and more to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos, and automate complex work.

Hot Takes

  1. Agent management is the same as management of humans
    If you have to go all the way down into the ground truth and double check the work especially for the first time you are working with a given prompt, LLM or human report! ~
    Garry Tan


  2. the future will split homo sapiens into the pleasure-seekers and the truth-seekers
    the former will submerge into ghiblified simulations, achieving ever deeper art & absurdity
    the latter will fight to the death for resources to build ever larger brains for understanding reality ~
    James Campbell

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads | Facebook

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉 

Reply

or to participate.