• unwind ai
  • Posts
  • OpenAI Advanced Voice Mode API

OpenAI Advanced Voice Mode API

PLUS: Microsoft Copilot can see, hear, speak and think deeper

Today’s top AI Highlights:

  1. OpenAI releases Advanced Voice Mode API, prompt caching, and a lot more

  2. Create interactive, stateful AI chat interfaces in your AI apps with LangChain’s assistant-ui

  3. Microsoft Copilot can now see, hear, speak, and think deeply (all together)

  4. Anthropic hires another OpenAI co-founder

  5. Scale AI apps effortlessly with Lepton’s 600+ tokens/sec and distributed inference

& so much more!

Read time: 3 mins

AI Tutorials

Meta’s new Llama 3.2 models are here, offering incredible advancements in speed and accuracy for their size. Do you want to fine-tune the models but are worried about the complexity and cost? Look no further!

In this blog post, we’ll walk you through finetuning Llama 3.2 models (1B and 3B) using Unsloth AI and Low-Rank Adaptation (LoRA) for efficient tuning in just 30 lines of Python code. You can use your own dataset.

With Unsloth, the process is faster than ever—2x faster, in fact. And the best part? You can finetune Llama 3.2 for free on Google Colab.

We share hands-on tutorials like this 2-3 times a week, designed to help you stay ahead in the world of AI. If you're serious about levelling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

🎁 Bonus worth $50 💵

Share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to get an AI resource pack worth $50 for FREE. Valid for a limited time only!

Latest Developments

All the Scoop from OpenAI Dev Day 🧑‍💻

OpenAI's Dev Day 2024 wrapped yesterday, but unfortunately, they did not livestream it this time. Many of us felt the FOMO and tried gathering some information from the lucky ones attending it. Anyway, the official announcements have dropped, offering plenty of new tools and features for developers to explore. The most exciting reveal was the new Realtime API, for voice-to-voice capabilities just like the Advanced Voice Mode. In addition, there are other exciting updates like prompt caching and vision fine-tuning.

Key Highlights:

  1. Realtime API (Advanced Voice Mode) - You can now create more natural, speech-to-speech experiences with low latency. The new API supports continuous audio streaming, real-time function calling, and improved voice interactions using six preset voices.

  2. Realtime API price - Something that would put off a lot of developers (humans at call centers charge less). The API uses both text tokens and audio tokens. API is rate-limited to approximately 100 simultaneous sessions for Tier 5 developers, with lower limits for Tiers 1-4.

API Price

Input tokens

Output tokens

Text

$5 per 1M

$20 per 1M

Audio

$100 per 1M

$200 per 1M

Total app.

$0.06 per minute of audio

$0.24 per minute of audio

  1. Chat Completions API - Audio input and output have been added to the Chat Completions API. This allows for both text and audio input/output in one call, simplifying multi-modal app development.

  2. Playground with Autoprompting - The Playground now offers autoprompting that generates few-shot examples and function-calling schemas for more structured prompts.

  3. Vision Fine-tuning - Now available on the latest GPT-4o model for all developers on the paid tier. You can fine-tune the model with as few as 100 images. OpenAI is offering 1M training tokens per day for free through October 31, 2024.

  4. Prompt Caching - It is now automatically applied on the latest versions of GPT-4o, GPT-4o mini, o1-preview and o1-mini, as well as fine-tuned versions of those models. This would reduce the cost by 50% and the latency by reusing tokens from previously processed prompts.

  5. Model Distillation - The new distillation feature allows you to fine-tune smaller, cost-efficient models using outputs from advanced models. This integrated workflow makes it easier to match performance while lowering the overall API costs for specific tasks.

Developers building AI-driven applications with LangGraph now have a new tool to streamline the frontend: assistant-ui. This React-based chat interface is optimized for embedding AI interactions in web applications and integrates smoothly with LangGraph Cloud. The combination allows developers to deploy stateful, scalable AI agents with user-friendly interfaces. Key features like streaming, human-in-the-loop interactions, and flexible data rendering make it a strong option for creating advanced AI-driven chat experiences.

Key Highlights:

  1. Direct Streaming of Responses - Assistant-ui handles streaming from LangChain models in real-time. This means developers can instantly display token-by-token AI responses, enhancing responsiveness and reducing user waiting time.

  2. Human-in-the-Loop - Assistant-ui allows users to review and approve agent actions, providing an extra layer of control. This is particularly useful in applications involving critical decisions, like financial transactions, where oversight is crucial.

  3. Generative UI for Data Visualization - It enables the integration of structured outputs, like tool call results, into custom UI components. This allows developers to create visually rich displays for complex data, such as financial analysis or stock prices.

  4. Multimodal Interaction - The interface supports images and documents to interact with agents, making conversations more dynamic.

  5. Getting started - It’s pretty easy to get LangGraph Cloud and an assistant-ui frontend to work together. For detailed instructions on using generative UI, human-in-the-loop interactions, tool call approvals, and integrating into existing apps, refer to the documentation.

Quick Bites

Microsoft Copilot has been upgraded with a new sleek UI and powerful AI features to enhance how you interact with it. These additions include natural voice control, real-time visual understanding, deeper reasoning for complex questions, and a fresh approach to search that delivers smarter, more detailed results.

  1. Copilot Voice – Talk to your Copilot for quick answers, brainstorming, or even daily summaries. Currently available in English across the U.S., U.K., Canada, Australia, and New Zealand.

  2. Copilot Vision – A breakthrough feature that reads and understands text and images on your screen, helping you navigate tasks effortlessly. Available soon in the U.S. via Copilot Labs.

  3. Think Deeper – Delivers more thoughtful and detailed responses to complex queries by taking additional time to reason. This experimental feature is now live in Copilot Labs for select regions.

  4. Generative Search – Bing’s AI-powered search now generates deeper insights from your queries, much like Google's AI Overviews. Available in beta for U.S. users.

SoftBank is set to invest $500 million in OpenAI's latest $6.5 billion funding round. Apple has reportedly dropped out of plans to participate in the large funding round, which currently values the AI startup at $150 billion before the SoftBank investment (even after $500 million, SoftBank gets peanuts 🥜).

Anthropic hired another (lesser-known) OpenAI co-founder Durk Kingma. At OpenAI, Kingma focused on basic research, leading the algorithms team to develop generative AI systems like DALL.E and ChatGPT. His hiring is yet another talent coup for Anthropic, which recruited OpenAI’s former safety lead, Jan Leike, and another OpenAI co-founder, John Schulman.

Tools of the Trade

  1. Lepton AI: A cloud platform to simplify AI inference and training with high-performance GPU infrastructure. It lets you build, test, and deploy models quickly while supporting large-scale workloads and providing tools like dynamic batching and fast LLM engines.

  2. Gaia: Build and train custom neural machine translation models without coding. It provides an easy interface for uploading training data, configuring model parameters, and deploying the translator with API access and real-time tracking.

  3. Crawl4AI: An opensource web crawler and scraper optimized for LLMs. It enables fast, multi-format data extraction with support for dynamic multimedia content, proxy use, and asynchronous crawling.

  4. Awesome LLM Apps: Build awesome LLM apps using RAG to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos through simple text. These apps will let you retrieve information, engage in chat, and extract insights directly from content on these platforms.

Hot Takes

  1. Sam Altman says o1-preview is "deeply flawed," but o1 will be better!!
    A few months ago, he said GPT-4 sucks and that the next model will take everyone's breath away. 😕
    I guess since everything sucks or is deeply flawed, we shouldn't be using any of his models, right?🤣 ~
    Bindu Reddy

  2. The secret to true AGI will not be a single omnipotent model...
    It'll be a series of well-defined ontologies for specific problem spaces with many AI models + code to strike the right balance of intuition + logical guardrails for agentic reasoning, planning and decision-making ~
    Ted Werbel

Meme of the Day

That’s all for today! See you tomorrow with more such AI-filled content.

🎁 Bonus worth $50 💵 

Share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to get AI resource pack worth $50 for FREE. Valid for limited time only!

Unwind AI - Twitter | LinkedIn | Threads | Facebook

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉 

Reply

or to participate.