unwind ai
Posts
OpenSource Computer Use Agent Outperforms OpenAI & Anthropic

OpenSource Computer Use Agent Outperforms OpenAI & Anthropic

PLUS: AgentKit by BCG X, Opensource SOTA multimodal Gemma 3

Shubham Saboo & Gargi Gupta
March 13, 2025

Today’s top AI Highlights:

BCG releases opensource full-stack starter kit to build agentic apps
Opensource Computer and Smartphone Use Agent outperforms OpenAI and Anthropic
Google releases opensource SOTA multimodal Gemma 3 models
Universal Agent Interface to interact with various MCP servers
AI visual IDE to build React apps 10x faster

& so much more!

Read time: 3 mins

AI Tutorials

OpenAI just released its Agents SDK, a rebranded, production-ready, and advanced version of the OpenAI Swarm framework to build multi-agent applications. Keep reading for more details👇. We couldn't wait to get our hands on it and build something useful.

In this tutorial, we'll walk you through building a multi-agent research assistant using OpenAI's Agents SDK. You'll create a system where multiple specialized agents work together to research any topic, collect facts, and generate comprehensive reports — all within a user-friendly application that's easy to use and extend.

We share hands-on tutorials like this 2-3 times a week, designed to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Build a Multi-Agent Researcher with OpenAI Agents SDK

Fully functional AI agent app with step-by-step instructions (100% opensource)

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Latest Developments

Full-Stack Agent Start Kit to Build Production-Grade Apps 🧰📈

BCG X has released AgentKit, an open-source, full-stack framework that uses LangChain, Next.js 14, and FastAPI to help you build AI agents faster. This starter kit provides a pre-configured architecture for creating "constrained agents," which means you get built-in control over the agent's decision-making process.

Expect a reactive UI with streaming, code rendering, and action status updates right out of the box, alongside integrated features like authentication and task queuing to get you closer to a deployable MVP.

Key Highlights:

Ready-to-use component stack - AgentKit provides a modular, easy-to-configure tech stack with FastAPI, Next.js 14, and LangChain integration. The starter kit includes authentication, queue management, caching, and monitoring capabilities to help you build production-ready MVPs.
Agent-specific UI components - The framework includes a React-based chat interface with support for streaming responses, rendering tables, visualizations, code blocks, and action status indicators. These components are specifically designed for agent interactions and can be easily configured.
Constrained Routing - AgentKit addresses the common reliability issues of ReAct-style agents by using pre-configured "Action Plans" that constrain possible execution paths. This leverages human domain expertise to guide the agent through predictable routes.
Built-in LangSmith integration - It comes with native LangSmith support for comprehensive tracing, debugging, and evaluation of your agent applications. You can assess the meta agent's routing decisions, evaluate individual tool performance, and measure final output quality.
Transparent execution - It streams intermediate outputs to users, showing the agent's reasoning process and actions in real-time. This creates a more transparent experience where database queries, PDF retrievals, and other agent actions can be seen as they happen.

Open, Modular, and Scalable Framework for Computer Use Agents 💻📱🔓

We have Anthropic’s and OpenAI’s Computer Use models available via API. But here’s a computer use AI agent outperforming both of them, completely opensourced under Apache 2.0!

Agent S2 is an open-source AI agent framework for autonomous computer and smartphone use. This new agent interacts directly with the GUI by capturing screenshots and translating user instructions into mouse and keyboard actions, eliminating the need for API access or accessibility trees.

Agent S2 claims state-of-the-art performance on key benchmarks, achieving 34.5% accuracy on OSWorld's 50-step evaluation (beating OpenAI's CUA/Operator) and 50% accuracy on AndroidWorld. Importantly, it's built on a modular architecture that's not just about raw power, but also about flexibility and continuous learning.

Key Highlights:

Brain-Inspired Design - Agent S2 combines specialized models for low-level execution with generalist models for high-level planning. This architecture allows different components to handle specific tasks they excel at.
Screenshot-Based Understanding - Agent S2 operates solely on raw screenshots as input. It uses dedicated visual grounding models to accurately identify and interact with UI elements (buttons, text fields, etc.) without requiring structured accessibility data.
Proactive Planning and Agentic Memory - Unlike current models that fix the errors after they occur, Agent S2 goes beyond reactive error handling. It proactively updates its plans after each subtask. Furthermore, its agentic memory mechanism allows it to learn from past successes and failures, refining its strategies over time.
Opensource and Ready-to-Use - Agent S2 is available for immediate download from GitHub, with straightforward installation via pip. Developers can integrate it into their projects using the gui-agents SDK or run it directly through command line. The framework supports various LLM providers including OpenAI, Anthropic, and local models.

Quick Bites

Google has released some very cool updates to its models:

Google released Gemma 3, a new family of lightweight, state-of-the-art open models, built using the same technology as Gemini 2.0. Gemma 3 comes in various sizes - 1B, 4B, 12B, and 27B parameters - where the 27B model ranks as the #2 opensource model on the LMArena leaderboard, right after DeepSeek R1, and the 1B model is optimized for on-device AI.

Multilingual: Supports over 35 languages out-of-the-box and has pretrained support for over 140 languages.
Multimodal (Text & Vision): Can analyze images, text, and short videos.
Context Window: 128k-token context window for handling large information.
Function Calling: Supports function calling for automation and agentic experiences.
Performance: Outperforms o1-preview, o3-mini, DeepSeek V3, and Llama models on LM Arena.
Availability: Try it in your browser on the Google AI Studio. Available to download on Hugging Face and Kaggle.

Gemini 2.0 Flash now natively supports image generation, now available to all via an experimental release in the Gemini API and Google AI Studio. From simple prompts, Gemini 2.0 Flash can generate both images and text together (to create storyboards for instance), edit specific parts of the image without disturbing other elements, and render long sequences of text within the image.

The Gemini API and AI Studio can now directly process YouTube URLs. You can pass in a YouTube video’s link in your prompt and the model can use its native video understanding capabilities to summarize, translate, or otherwise interact with the video content. You can upload not more than 8 hours of YouTube video per day, and not more than 1 video per request.

Agno has open-sourced Universal Interface for MCP agents, a powerful unified interface for interacting with multiple MCP servers and tools. Supports a range of LLMs from OpenAI, Anthropic, Google, and Groq. The agent analyzes your request and determines which MCP tools to use. It then connects to the appropriate MCP server, the agent executes the necessary tools through the MCP server, and the results are processed and returned in a natural language response.

Perplexity's Sonar API now has an MCP server, for AI agents and applications like Claude and Cursor to do real-time web searches. This integration allows Claude to directly access Perplexity's search capabilities, providing accurate and up-to-date information within conversations.

Tools of the Trade

Coolify: Open-source, self-hostable alternative to Heroku/Netlify/Vercel to deploy and manage applications on your own servers with just an SSH connection. It provides robust DevOps features like Git integration, automatic SSL certificates, and real-time monitoring while ensuring no vendor lock-in.
Skymel OA: AI orchestration layer that automatically selects the optimal model for each request, distributes processing between cloud and device, and continuously optimizes performance—all through a simple API that replaces direct calls to various AI providers. Reduces costs by 40-95% and improves performance by 2-10x.
Tempo: AI-powered visual IDE that combines VS Code-like code editing, Figma-like visual design tools, and AI to help teams build React applications 10x faster.
Awesome LLM Apps: Build awesome LLM apps with RAG, AI agents, and more to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos, and automate complex work.

Hot Takes

Last month I met with 12 AI startups. 11 will fail.
The trap? When I ask "What happens when OpenAI, Meta, or Google releases this as a feature next month?" the room goes silent. Most AI startups are building features, not companies.
Winners are tackling massive problems with solutions that happen to use AI. The best AI companies aren't "AI companies.” ~
Itamar Novick
Interview template for those of you hiring software people:
"Here is a problem. Solve it and explain what you did. I don't care how you do it. Use AI, use a book, Google the solution. I don’t care. Go!"
LeetCode interviews in the age of AI are ridiculous. Stop that bullshit.
Candidates, if that's all the company has to offer, find a better company. ~
Santiago

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads | Facebook

Awesome LLM Apps | Sponsor Us

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉

Reply

or to participate.