unwind ai
Posts
Build Computer and Mobile-use Agent for Free

Build Computer and Mobile-use Agent for Free

PLUS: o1-level multimodal model with web search, DeepSeek R1 reasoning + Claude's coding capabilities

Shubham Saboo & Gargi Gupta
January 28, 2025

Today’s top AI Highlights:

Build OpenAI Operator-like agent for Computer and mobile phones for free
OpenAI o1-level multimodal model with web search - 100% free and unlimited usage
Fully open reproduction of DeepSeek-R1 by Hugging Face
Opensource Qwen model with 1M context window
Combine DeepSeek R1’s reasoning with Claude's creativity and code generation

& so much more!

Read time: 3 mins

AI Tutorials

Sales teams spend countless hours manually searching for and qualifying potential leads. This repetitive task not only consumes time but also results in inconsistent lead quality. Let’s automate this process to help sales teams focus on what matters most - building relationships and closing deals.

In this tutorial, we'll build an AI Lead Generation Agent that automatically discovers and qualifies potential leads from Quora. Using Firecrawl for intelligent web scraping, Phidata for agent orchestration, and Composio for Google Sheets integration, you'll create a system that can continuously generate and organize qualified leads with minimal human intervention.

Our lead generation agent will help sales teams identify potential customers who are actively discussing or seeking solutions in their target market.

We share hands-on tutorials like this 2-3 times a week, designed to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Build an AI Lead Generation Agent

Fully functional AI agent app with step-by-step instructions (100% opensource)

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Latest Developments

Qwen’s Vision Models for Computer and Mobile Use Agents 🖥️📱

Alibaba's Qwen team just released Qwen2.5-VL, a powerful vision-language model series that can control computers and phones through simple text commands. Available in 3B, 7B, and 72B sizes on Hugging Face and ModelScope, it lets you build AI agents that can browse the web, fill forms, and handle complex visual tasks.

What makes it stand out is that it's free to use, unlike OpenAI's similar Operator agent which costs $200 a month. The 72B model shows impressive results in real-world tests, especially when working with documents and acting as a visual agent - and it does this right out of the box without extra training.

Key Highlights:

Interface Control - The model functions as a visual agent that can understand and execute commands to control computers and phones, handling tasks like booking tickets and filling forms. It processes screenshots and GUI elements to navigate interfaces naturally, adapting to different screen layouts and interface changes.
Built for Documents - Implements a custom HTML-based document parsing format for extracting layout information from diverse sources like research papers, magazines, and mobile screenshots. The model excels at understanding complex documents, charts, and structured data.
Handles Long Videos - Can process hour-long videos and pinpoint specific moments or events within them. Perfect for building apps that need to search through video content or extract information from specific time segments.
Performance on Visual Tasks - The 72B model performs on par or outperforms state-of-the-art models like GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash in visual tasks including documents and diagrams reading, and visual agentic tasks.
Get Started Easily - Jump right in through Hugging Face or ModelScope. The model is optimized for efficiency while keeping high performance, so you can build and scale your applications without getting bogged down by complexity.

Free OpenAI o1-level Multimodal Model with Web Search 💰🔍

If you thought Chinese companies are taking a pause after DeepSeek R1 and Qwen VL 2.5, you’re mistaken. China’s Moonshot AI has released Kimi 1.5, an ambitious multimodal AI model that achieves OpenAI o1-level performance while being completely free and unlimited to use.

The model introduces sophisticated chain-of-though (CoT) reasoning, 128k context window, and novel reinforcement learning techniques, leading to exceptional performance in mathematics, coding, and visual understanding tasks. Plus it comes wwith integrated real-time web search functionality. In benchmarks spanning mathematical reasoning, coding challenges, and visual tasks, Kimi 1.5 consistently outperforms GPT-4o and Claude 3.5 Sonnet, while its long-form reasoning capabilities match OpenAI's o1 across the board.

Key Highlights:

New Learning Architecture - Utilizes an innovative partial rollout system that intelligently reuses previous reasoning paths instead of starting from scratch each time, significantly reducing computational overhead while maintaining high-quality outputs. This makes the model both efficient and cost-effective to run.
Multimodal Understanding - Seamlessly processes both text and visual inputs, enabling developers to build applications that reason across modalities. Excels at tasks like converting images to structured code, identifying locations from visuals, and performing detailed object analysis including color and quantity recognition.
Performance Edge - Demonstrates superior performance in short-form reasoning compared to GPT-4o and Claude 3.5 Sonnet, with up to 550% improvement in coding and mathematical tasks. For complex reasoning problems, the model matches o1's capabilities across mathematics, coding competitions, and visual reasoning benchmarks.
Zero-Cost Access - Kimi 1.5 is available completely free with unlimited usage through Kimi.ai.

Quick Bites

HuggingFace is leading an open-source effort to fully reproduce DeepSeek R1 model’s training pipeline, including training code, evaluation methods, and synthetic data generation. The project makes DeepSeek's powerful reasoning model architecture accessible, with step-by-step instructions for replicating both the distillation process and reinforcement learning pipeline on common GPU setups.

Alibaba's Qwen team has open-sourced their Qwen2.5-1M models, featuring 7B and 14B parameter versions that can handle an impressive 1-million token context length. Alongside the models, they've released an optimized inference framework built on vLLM that processes these lengthy inputs 3-7x faster than standard approaches, making it practical to work with such extensive contexts.

OpenAI's o1 and DeepSeek R1 are gaining attention for their exceptional reasoning capabilities, and now you can bring similar step-by-step reasoning to any LLM with LLM-Reasoner, an open-source tool that adds structured thinking to your models. This MIT-licensed package lets you visualize your LLM's reasoning process in real-time, with features like confidence tracking, custom model registration, and out-of-the-box support for major LLM providers including OpenAI, Anthropic, and Google.

DeepSeek-R1 has demonstrated impressive capabilities in low-level optimization, contributing 99% of the code in a recent llama.cpp PR that doubles WASM execution speed through SIMD instruction optimization. The PR, which focuses on optimizing dot product functions for quantized operations, was achieved by running carefully crafted prompts, with the model taking 3-5 minutes to generate each code solution.

Tools of the Trade

DeepClaude: A high-performance LLM inference API that combines DeepSeek R1's CoT reasoning capabilities with Anthropic Claude's creative and code generation. It is 100% free and you use your own keys. The API wraps both DeepSeek and Anthropic streaming API into one to leverage the strengths of both models.
Open Computer Use: Open-source project to enable secure AI control of virtual Linux computers through E2B's sandboxed environment, allowing LLMs like Llama 3 and OS-Atlas to operate a graphical desktop interface via keyboard, mouse, and shell commands
LLMule: Open-source desktop application that provides a ChatGPT-like interface for running AI models locally or connecting to a P2P network of shared models. It integrates with tools like Ollama, LM Studio, vLLM and Exo.
Awesome LLM Apps: Build awesome LLM apps with RAG, AI agents, and more to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos, and automate complex work.

Hot Takes

The most underrated aspect of DeepSeek is emergence.
Chain-of-thought, reflection, self-verification, and long CoTs all emerging naturally from simple RL over base model.
Reading this gave chills, in the same way when I read about “grokking” in GPT-3 where the model generalized suddenly at certain scale. ~
Amjad Masad
the line between researchers and developers has never been more blurry than in the age of AI
if you’re a researcher, it’s time to build
if you’re a developer, start experiments ~
Logan Kilpatrick

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads | Facebook

Awesome LLM Apps | Sponsor Us

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉

Reply

or to participate.