• unwind ai
  • Posts
  • Google's AI Agent that can See, Hear & Speak

Google's AI Agent that can See, Hear & Speak

PLUS: Major AI updates from Google I/O 2024, Ilya leaves OpenAI

Today’s top AI Highlights:

  1. Google extends Gemini 1.5 Pro’s context window to 2 Million tokens

  2. Google’s next text-to-video model competing with OpenAI’s Sora

  3. Project Astra real-time AI assistance in video calls

  4. Gemma 2 will be released soon, outperforming models double its size

  5. OpenAI’s Chief Scientist Ilya Sutsvekar leaves the company

  6. GitHub Copilot Chat is now available on GitHub mobile app

& so much more!

Read time: 3 mins

Latest Developments 🌍

Google wrapped up its I/O 2024 event yesterday and AI was the center theme as expected. Precisely 120 times the word “AI” was used throughout the event. A heap of releases and announcements around AI were made by the team. However, this time Google also showcased its new approach - agentive AI which is capable of reasoning, planning, memorizing, and taking action to give you better assistance with more context about you! Here’s everything that was announced at the event:

  1. Gemini Advanced in Chat with Gemini

    1. Gemini Advanced now uses Gemini 1.5 Pro. Its humongous context length of 1 Million tokens can be now used in Chat with Gemini, in 35 languages.

    2. Earlier it could process only text and images. Now you can give it multiple documents and files, and it can analyze them and create charts.

  2. Gemini 1.5 Flash

    1. This is a lighter version of Gemini 1.5 Pro that closely matches 1.5 Pro’s performance across all benchmarks, but for much less latency and cost.

    2. It is also multimodal and has 1 million context window just like 1.5 Pro.

    3. It is available in Google AI Studio, Vertex AI, and via API for $0.35 per million tokens.

  3. Gemini 1.5 Pro Context: The context window has been increased from 1 Million to 2 Million tokens. It is available in private preview. You can apply for the waitlist here today.

  4. Gemini 1.5 Pro API:

    1. Two new API features have been introduced - video frame extraction and parallel function calling, which lets you return more than one function call at a time.

    2. In June, context caching will also be added so you only have to send parts of your prompt, including large files, to the model once.

    3. The API cost has been slashed from $7 to $3.5 for one million tokens, for prompts up to 128k tokens.

  1. The new PaliGemma model is Google’s first opensource vision-language model optimized for image captioning, visual Q&A and other image labeling tasks.

  2. Gemma 2, the next-gen of Gemma model, will be released in June. It will have 27B parameters, outperform models 2x its size, and run efficiently on GPUs or a single TPU.

Showcasing its progress in AI agents that can reason and memorize, Google is building Project Astra through which you’ll be able to take AI’s assistance on video calls. This has an edge over OpenAI’s Voice Assistant as Google’s Gemini model benefits from natively processing videos and large context window.

  1. Text-to-video Model Veo: Google’s new text-to-video model Veo generates high-quality 1080p resolution videos in a wide range of cinematic and visual styles that can be extended to beyond a minute. It can generate videos from both text and images. VideoFX tool, powered by Veo, is available in private preview in AI Labs.

  2. Text-to-image Model Imagen 3: The latest Imagen 3 model can generate photorealistic, lifelike images, with far fewer distracting visual artifacts. It is also good at generating text in images.

  3. Music Production: Google is developing a suite of music AI tools called Music AI Sandbox with YouTube, to create new instrumental sections from scratch.

  1. Gemini in the side panel of Gmail, Docs, Drive, Slides and Sheets will now use Gemini 1.5 Pro for longer context window and better reasoning abilities.

  2. Gmail - New updates to the Gmail app will be rolled out soon with a summarize button at the top of your email thread to get the highlights. Gemini will also give suggestions for email replies which are contextually relevant.

  3. Google Workspace apps are already stitched together. Now Gemini can automate your workflow across these apps. For eg., it can automatically make a spreadsheet of your receipts from Gmail, and lets you query Sheets with Data Q&A.

  1. You can now ask multiple questions at once in Google Search. Powered by a new Gemini model customized for Search, it can do multi-step reasoning and deliver results to each question in the same result.

  2. With advanced planning capabilities, Search can also plan trips and events especially customized for you. You can ask it to make adjustments as you like.

  3. The AI Overview section in Search will soon have “Break it down” and “Simpler” options to help you simplify the language and understand the answers better.

  4. Search with video with Gemini Live - When you have questions about objects in motion, you can record its video along with your question, and it’ll give you the answer.

Powered by Gemini 1.5 Pro, this feature lets you search photos and videos using questions in natural language. It can understand the context of photos to find specific details and answer questions about past events. No need to scroll through thousands of photos to find that one!

Gems are customized versions of Gemini. Simply describe what you want your Gem to do and how you want it to respond — like “you’re my running coach, give me a daily running plan and be positive, upbeat and motivating.” Gemini will take those instructions and create a Gem that responds the way you want.

Google has announced its 6th gen TPU called Trillium which delivers a 4.7x improvement in compute performance compared to TPU v5e. It is also over 67% more energy-efficient than TPU v5e. It will be available to cloud customers by 2024 end.

Google has introduced LearnLM, a new family of models fine-tuned for learning, based on Gemini. It’ll soon be integrated with products like like Search, YouTube and when chatting with Gemini to help you deepen understanding, rather than just giving an answer.

For instance, simplified answers in AI Overview will be powered by LearnLM. In YouTube, you’ll soon be able to “raise your hand” and chat with AI to clarify your doubts. This is especially good for long academic videos.

OpenAI’s Chief Scientist Leaves 🧳

Ilya Sutskever, OpenAI’s chief scientist and one of its co-founders, has left the company. OpenAI CEO Sam Altman announced the news on X, sharing that Sutskever is moving on to a project that holds personal significance for him. OpenAI’s Director of Research Jakub Pachocki, who has been with OpenAI since 2017, will take over Sutskever’s role.

Sutskever has not been much in people’s eyes after all the drama that happened last November, and some actually anticipated to happen. His departure is significant as OpenAI has released the most prominent models under his leadership. This shift could impact the direction of OpenAI’s future research.

😍 Enjoying so far, share it with your friends!

Tools of the Trade ⚒️

  1. Openlayer: Automates AI model evaluation by connecting to your GitHub repository and running defined tests on every commit. You can also use Openlayer to monitor live AI systems by setting up tests to run on your real-time data and receive alerts if performance degrades.

  2. GitHub Copilot Chat on mobile: GitHub Copilot Chat is now integrated with the GitHub mobile app. Take help of your AI coding assistant on the go, ask coding questions and get answers, and explore repositories, all using natural language.

  3. VoiceCheap: Dub and translate videos into 30 languages, generate subtitles, and remove background noise. VoiceCheap offers features like SmartSync for natural timing, lip-sync for multi-speaker videos, and text-to-speech services.

  4. Awesome LLM Apps: Build awesome LLM apps using RAG for interacting with data sources like GitHub, Gmail, PDFs, and YouTube videos through simple texts. These apps will let you retrieve information, engage in chat, and extract insights directly from content on these platforms.

Hot Takes 🔥

  1. After watching Google I/O, it's safe to say what OAI showed yesterday was mind-blowing!! 🤯🤯Astra is a prototype voice assistant and seemed like a 2-year-old baby to OAI's Scarlett Johansson!! ~Bindu Reddy

  2. Thanks to OpenAI for forcing Google to up its game. ~Pedro Domingos

Meme of the Day 🤡

That’s all for today! See you tomorrow with more such AI-filled content.

Real-time AI Updates 🚨

⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!

PS: I curate this AI newsletter every day for FREE, your support is what keeps me going. If you find value in what you read, share it with your friends by clicking the share button below!

Reply

or to participate.