- unwind ai
- Posts
- ChatGPT becomes Multimodal 🌐
ChatGPT becomes Multimodal 🌐
PLUS: Spotify's Multi-lingual Voice Cloning, Robots Learn from Internet Videos
Today’s top AI Highlights:
ChatGPT can now See, Hear, and Speak
Robotic Learning from Internet Human Videos
Spotify Collaborates with OpenAI for Voice Cloning in Multiple Languages
Getty Image Releases Text-to-Image Model
& so much more!
Read time: 3 mins
Latest Developments 🌍
ChatGPT can See, Hear, and Speak 🧒
OpenAI has introduced voice and image capabilities in ChatGPT. The new features will soon be rolled out to Plus and Enterprise Users and will be available on iOS and Android also.
Key Highlights:
Voice Capability:
Will allow users to engage in natural voice conversations with ChatGPT. This feature offers five different voices, created in collaboration with professional voice actors, and is powered by a new text-to-speech model and Whisper.
Collaboration with partners like Spotify demonstrates the versatility of this voice technology for podcast translation, expanding storytelling reach.
The model is proficient at transcribing English text but performs poorly with some other languages, especially those with non-roman script.
Image Capability:
Will let users show one or more images to ChatGPT for a wide range of tasks, like troubleshooting, data analysis, and reasoning tasks.
The mobile app includes a drawing tool to focus on specific image details.
The multimodal capabilities leverage GPT-3.5 and GPT-4, applying language reasoning skills to both text and images.
GPT-4V (Vision): OpenAI has released system card for GPT-4V, the model behind image capabilities in ChatGPT.
OpenAI collaborated with Be My Eyes to develop GPT-4V which was used to assist people with visual impairments.
GPT-4V's training process incorporates text and image data from the internet and licensed sources.
OpenAI's rigorous safety evaluations and mitigations along with RLHF and red-teaming to ensure responsible deployment and address challenges like hallucinations and high-stakes interpretations.
Robots Learn from Videos 📺
Researchers at Google DeepMind and UC Berkely have introduced Video Pre-Training for Robots (V-PTR), that leverages internet-scale human video data to enhance robotic reinforcement learning (RL) and teach robots valuable skills by watching human videos online.
Key Highlights:
Internet videos are rich in real-world experiences, but they lack the specific information needed for robots to understand and replicate actions. V-PTR bridges this gap and enables robots to generalize and perform tasks more effectively.
V-PTR takes a step-by-step approach, teaching robots the big picture from videos, actions that lead to outcomes, and how to apply this knowledge to specific tasks.
This research highlights the effectiveness of TD-learning, allowing robots to learn and improve by watching human actions in videos, a significant leap forward in robotic learning.
Speak from English to Español in a Jiffy 🎙️
Spotify is introducing an AI-powered voice translation feature in partnership with OpenAI, allowing podcasters to replicate their voices in other languages.
Key Highlights:
Initially, the tool will translate English-language podcast episodes into Spanish, with plans to add French and German translations in the near future.
The core technology behind this feature is OpenAI's speech-to-text model Whisper, which can transcribe English and translate other languages into English.
Watch some notable podcasters like Lex Fridman, Dax Shepard, and Steven Bartlett trying out this feature:
AI Art + Copyright Protection 🔒
Getty Images has released its text-to-image model Generative AI by Getty Images that utilizes an AI model provided by Nvidia, and was trained on a portion of Getty's extensive library of approximately 477 million stock assets.
Getty's tool not only competes with DALL.E-3 and Midjourney but also offers protection against copyright lawsuits and the right to use the images worldwide and perpetually.
Tools of the Trade ⚒️
ChatDev: Virtual software company with multiple intelligent AI agents that form a multi-agent organizational structure. It is a highly customizable and extendable framework based on LLMs for studying collective intelligence.
Recall: AI knowledge base that summarizes, categorizes, and reviewes online content with features like automatic categorization and spaced repetition.
Vespio AI: Boost sales with AI-powered conversation analysis, sentiment prediction, and smart suggestions for higher win rates and revenue.
Labelbox: A data-centric platform for building smart apps, offering unified LLM creation, vision tools, AI model integration and data visualization.
Edgar: Your 24/7 AI assistant designed for streamlining tasks, automating workflows, managing outreach, and enhancing productivity through intuitive chat interactions.
😍 Enjoying so far, TWEET NOW to share with your friends!
Hot Takes 🔥
Mfers will equate having llama-3 on your local machine to holding a tactical nuke ~ anton
short timelines and slow takeoff will be a pretty good call i think, but the way people define the start of the takeoff may make it seem otherwise ~ Sam Altman
Your app’s name doesn’t matter. The most important product of this century is called ChatGPT. ~ Nikita Bier
Meme of the Day 🤡
That’s all for today!
See you tomorrow with more such AI-filled content. Don’t forget to subscribe and give your feedback below 👇
Real-time AI Updates 🚨
⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!!
PS: I curate this AI newsletter every day for FREE, your support is what keeps me going. If you find value in what you read, share it with your friends by clicking the share button below!
Reply