- unwind ai
- Posts
- Claude Sonnet 3.5 is the New Autonomous AI Agent
Claude Sonnet 3.5 is the New Autonomous AI Agent
PLUS: Opensource SOTA video generation model, Evaluate code generation with LLMs
Today’s top AI Highlights:
Anthropic teaches Claude to autonomously operate a computer
Skip test cases and let small LMs evaluate your code better
Perplexity lets you do multi-step complex research with a simple prompt
Opensource text-to-video model that outperforms Sora, Pika, and Gen-3
Opensource, local-first Figma for React apps
& so much more!
Read time: 3 mins
AI Tutorials
Here is a smart AI agent that not only retrieves answers from PDFs but also searches the web in real time—all with minimal code.
In this tutorial, we’ll walk through how to create a Retrieval-Augmented Generation (RAG) agent that uses GPT-4o for intelligent querying. Your agent will tap into a PDF-based knowledge base and perform web searches using DuckDuckGo, providing rich insights through a sleek playground interface.
Using Phidata, a framework designed for building agent-based systems, we’ll streamline the entire setup. You’ll combine tools like LanceDB for vector-based searches, PDF knowledge embedding, and interactive browsing. The result? A powerful AI assistant ready to handle complex queries with ease.
We share hands-on tutorials like this 2-3 times a week, designed to help you stay ahead in the world of AI. If you're serious about levelling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.
🎁 Bonus worth $50 💵
Latest Developments
Anthropic has rolled out two new model upgrades, Claude 3.5 Sonnet and Claude 3.5 Haiku, with a game-changing feature—computer use. This feature lets Sonnet interact with software the way people do, by navigating screens, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta, which you can use to automate a host of tasks, from repetitive and cumbersome form-filling to application testing and what not.
Alongside this, both models bring improvements particularly in coding and problem-solving, outperforming advanced models like OpenAI o1-preview and specialized systems designed for agentic coding.
Key Highlights:
Computer Use Beta - Claude 3.5 Sonnet takes screenshots of your computer screen, interprets the GUI, and generates appropriate tool calls to perform requested tasks. It can plan its actions, navigate websites, and follow multi-step processes, just like a human would operate a computer. Refer to the documentation to use the computer-use API with Sonnet.
Model Performance - Both Claude 3.5 Sonnet and Haiku have improved across the board in reasoning, coding, math, problem-solving, etc. Claude 3.5 Sonnet shines in coding benchmarks, outperforming OpenA’s o1-preview in HumanEval and SWE-bench Verified.
Availability and Updates - Claude 3.5 Sonnet is available now on claude.ai and via AI, and Claude 3.5 Haiku is set to be released later this month. Both models can also be accessed on Amazon Bedrock and Vertex AI.
Knowledge Cutoff - Claude 3.5 Sonnet has an April 2024 knowledge cutoff, while Haiku’s is July 2024. You can start experimenting with these models immediately. Here’s a quick demo of the computer-use API that you can try now.
As LLMs take on more complex coding tasks, robust evaluation methods become essential. But let's face it: test cases are cumbersome, token-based metrics miss key details, and relying on huge models like GPT-4o gets expensive fast.
CodeJudge is a new way to evaluate LLM-generated code without the headache of traditional test cases. This LLM-powered framework analyzes the code's semantic correctness and provides two types of assessments: a binary correct/incorrect judgment and a nuanced evaluation of how well the code aligns with what you intended. Early results are impressive, showing CodeJudge beating existing methods, even with smaller LLMs like Llama-3-8B-Instruct.
Key Highlights:
Outperforms existing methods - CodeJudge consistently outperforms current evaluation methods across various LLMs and programming languages (including Python, JavaScript, Java, C++, and Go). This makes it a reliable choice no matter your preferred LLM or project.
Provides deeper insights - CodeJudge goes beyond simple pass/fail by offering both a binary correctness judgment and a graded evaluation based on error severity. This helps you understand why code isn't working, not just that it isn't working.
Easy to integrate - CodeJudge is available on GitHub and designed for easy integration into existing LLM-driven code generation systems. No complex setup required—get started quickly and easily.
Quick Bites
AI startup Genmo has released a new opensource text-to-video model Mochi 1 that generates high-fidelity videos with excellent motion quality and prompt adherence. Licensed under Apache 2.0, Mochi 1 offers smooth 30fps videos with precise control. You can try Mochi-1 for free via the Genmo playground or download the weights from Hugging Face.
Stability AI has opensourced Stable Diffusion 3.5 text-to-image models, including a Medium, Large, and Large Turbo model. These models are highly customizable for their size, run on consumer hardware, and are free for both commercial and non-commercial use. You can download Stable Diffusion 3.5 Large and Large Turbo models from Hugging Face and the inference code on GitHub.
Cohere has released Embed 3, a multimodal AI search model that generates embeddings from both text and images for seamless search across diverse data types. Now available on Cohere's platform and Amazon SageMaker, Embed 3 can help businesses efficiently locate multimodal assets like reports, product catalogs, and design files.
Perplexity AI's Pro Search now features "Reasoning Mode," a beta feature that tackles complex, multi-step queries. This upgrade enables you to ask layered questions requiring extensive research and analysis, automatically compiling information into tables. It will automatically turn on when it detects hard prompts. Try prompting “pull me all the IMO medal winners from China in the last 5 years and give it to me a table.”
Tools of the Trade
Onlook: Opensource visual editor that allows you to edit your React app's UI in real-time and directly writes those changes back to the code. It integrates seamlessly with React and TailwindCSS projects, allowing you to modify layouts, styles, and components without disrupting your existing build process.
AgentStack: A command-line tool that creates boilerplate code for AI agent projects using popular frameworks like CrewAI and Autogen. It simplifies project setup and provides utilities for code generation, testing, and deployment.
Trieve: API-driven platform for building search, recommendation, and RAG apps. It offers features like semantic and full-text search, hybrid search with re-ranking, customizable relevance tuning, and self-hosting options.
Awesome LLM Apps: Build awesome LLM apps using RAG to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos through simple text. These apps will let you retrieve information, engage in chat, and extract insights directly from content on these platforms.
Hot Takes
I am not convinced ai "computer use" is good.
Its a cool af tech demo, but is it actually useful?
1. Theres a reason UI's exist. Chatbot interface is not good for high visual flows. Even if the computer can use photoshop for me, I won't want it to because I want to review every step. Similar for shopping, making any media artifact, or just about anything else, Im gonna want to double check things
2. Maybe theres some cool use cases of stringing things together? BUT, the bottleneck is human review of steps to make sure the ai did it correctly. Not can the computer does this all for me and spit out a final step. ~
Nick DobosOrion or GPT-5 is definitely not being released now because they are waiting to see Anthropic next move first.
If Opus/Haiku 3.5 brings major improvements, they'll likely to drop GPT-4.5. If not, expect GPT-5 to come around Dec or Q1 2025... ~
Haider.
Meme of the Day
That’s all for today! See you tomorrow with more such AI-filled content.
🎁 Bonus worth $50 💵
Share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to get AI resource pack worth $50 for FREE. Valid for a limited time only!
PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉
Reply