ModernBERT for Faster RAG

PLUS: OpenAI o3 models, Cursor's YOLO Mode

Today’s top AI Highlights:

  1. Run your AI code in the cloud with just a few Python decorators

  2. This open-source drop-in replacement for BERT makes RAG faster and better

  3. Cursor’s new YOLO mode can autonomously run terminal commands

  4. OpenAI's new o3 models set new benchmarks in reasoning, math and coding

  5. Open-source tool to visually test and analyze AI agent traces 

& so much more!

Read time: 3 mins

AI Tutorials

Building powerful RAG applications has often meant trading off between model performance, cost, and speed. Today, we're changing that by using Cohere's newly released Command R7B model - their most efficient model that delivers top-tier performance in RAG, tool use, and agentic behavior while keeping API costs low and response times fast.

In this tutorial, we'll build a production-ready RAG agent that combines Command R7B's capabilities with Qdrant for vector storage, Langchain for RAG pipeline management, and LangGraph for orchestration. You'll create a system that not only answers questions from your documents but intelligently falls back to web search when needed.

Command R7B brings an impressive 128k context window and leads the HuggingFace Open LLM Leaderboard in its size class. What makes it particularly exciting for our RAG application is its native in-line citation capabilities and strong performance on enterprise RAG use-cases, all with just 7B parameters.

We share hands-on tutorials like this 2-3 times a week, designed to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Latest Developments

Beam is a platform that simplifies running cloud-based AI workloads as well as introduces a unique approach to building stateful, multi-tasking AI agents. It lets you deploy serverless functions and web endpoints with simple Python decorators, eliminating the overhead of managing complex infrastructure or wrestling with Docker.

Beam also provides global managed GPU access and distributed storage, all accessed through a Python-first interface. What's particularly notable is its new framework for building agents that are concurrent by design, along with built-in support for fast cold starts, making it feasible to move complex, multi-threaded workflows directly into cloud environments.

Key Highlights:

  1. Agent Framework with Concurrency - Beam's agent framework lets you define complex, multi-tasking agent workflows. Using decorators like @bot.transition, you can build agents with stateful behavior that can run multiple tasks in parallel, making them more powerful than traditional DAG-based agent implementations.

  2. Python Agent Development - Like its other features, Beam's agent framework is highly Python-friendly. You define agents using standard Python code and Beam decorators. Focus on the logic and functionality of your agent, without needing to know the underlying infrastructure or container configuration.

  3. Integrated Cloud Infrastructure - Beam manages the underlying infrastructure, including global GPU access, and distributed storage. You can specify the necessary resources for each agent transition, plus they have access to a concurrency-safe distributed queue and dictionary. This lets your agents run reliably and scale automatically without manual intervention.

  4. Simple Deployment and Management - With Beam, you can deploy and test agents using familiar commands like beam serve for development and beam deploy for production. You're able to monitor and manage your running agents through Beam's dashboard, and gain insight into the state of your workflow, plus access event logs.

The long-standing champion of encoder models, BERT, finally has a successor: ModernBERT. Released by Answer.AI and LightOn, this new family of models promises a significant performance boost across many tasks including retrieval, classification, and code-related understanding, while also boasting a much longer context length of 8192 tokens.

ModernBERT is designed to be a direct, drop-in replacement, you can swap out your existing BERT-based models for it without major overhauls. With both a base (139M params) and large (395M params) version available, it has already shown benchmark-beating speed and efficiency improvements.

Key Highlights:

  1. Drop-in Replacement - ModernBERT is built to easily replace existing BERT or RoBERTa models in your workflows. It isn't just a slightly tweaked model; it offers a distinct jump in performance and speed, especially on long-context and mixed-length input processing, without requiring any architecture change to your existing systems. This will be especially relevant for tasks like RAG, classification, and feature embeddings.

  2. Extended Context Length - The 8192 token context length will open the doors to building a new class of encoder-based applications. Be it full document RAG, large-scale code analysis, and smarter AI-powered IDE features, all now feasible.

  3. Code Processing Capabilities - First encoder model to incorporate substantial code training data, scoring over 80% on the StackOverflow-QA dataset. This enables new applications like enterprise-wide code search and multi-repository feature analysis, with specialized understanding of programming contexts.

  4. Speed and Efficiency Focus - ModernBERT is optimized for consumer-grade GPUs (e.g. RTX 4090) and achieves significant speedups in both inference and training through innovative changes such as alternating attention, unpadding, and sequence packing. This will translate to lower cost and faster performance on local machines or in server environments.

  5. Availability - ModernBERT is available in both base and large sizes as a Masked Language Model (MLM), accessible via the Hugging Face Transformers library. You can directly utilize these models for fill-mask tasks, or fine-tune them for downstream tasks such as classification, retrieval, or Q&A.

Quick Bites

If you were hoping that OpenAI is releasing GPT-5 on the last day of its 12-day announcement series, here’s bad and good news! Not GPT-5, but OpenAI debuted O3 model series, the successor to their O1 reasoning models. O3 and O3 Mini are frontier models building on top of the O1 model. O3 is designed for complex tasks that require advanced reasoning, while O3 Mini balances high performance and lower costs.

  1. Performance Improvements - O3 achieves a state-of-the-art 71.7% accuracy on SWE benchmarks and an Elo of 2727 on competition code, both representing over 20% improvement over O1.

  2. Flexible Reasoning - O3 Mini will have a new reasoning effort settings (low, medium and high) that’ll allow you to fine-tune the model's computation for optimal performance and cost tradeoff.

  3. Early Access - The O3 family will be available for selected public safety testing, with O3 Mini anticipated for full launch around the end of January, and O3 shortly after.

Cursor’s AI agent just got a massive upgrade. It can now do background command processing, exit code monitoring, and has new commands like @docs and @git for expanded development workflows. There’s a new Yolo Mode in which the AI agent can autonomously execute terminal commands. Rounding out the release are some welcome improvements including a snappier bug-finding model, persistent composer states, and clever parallel editing capabilities that make the IDE feel more responsive

Supabase released version 2 of database.build, their AI-powered Postgres sandbox that runs directly in your browser, now featuring a bring-your-own-LLM option that lets you connect your preferred OpenAI-compatible language models. The update removes the need for GitHub login and rate limits when using your own LLM, giving you more control and ensuring your chat messages go only to providers you trust.

Tools of the Trade

  1. Explorer: Open-source observability tool to analyze AI agent traces through a visual interface, making it easier to debug agent behavior and identify failure points. It allows you to visualize traces step-by-step, add annotations, filter content, and share traces with your team for collaborative debugging.

  2. Sidekick: CLI tool that automates deploying multiple applications on a single VPS, handling everything from initial server setup to zero-downtime deployments with built-in features like SSL certificates, load balancing, and secret management - imagine running your own fly.io-like platform but on your personal VPS.

  3. Codegen by Groq: Generates complete JavaScript micro-applications in milliseconds with simple natural language prompts, powered by Llama 3.3 70B with speculative decode. You can use voice, vision, or text input to instantly generate, modify, and share JavaScript widgets.

  4. Awesome LLM Apps: Build awesome LLM apps with RAG, AI agents, and more to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos, and automate complex work.

Hot Takes

  1. Feels like more of the same with LLMs for over a year now... Small improvement here and there but trade off somewhere else (it got worse) ~
    anton

  2. If you are not planning for the price of intelligence to go to zero, the next 3-5 years are going to incredibly disruptive to your business / life.
    This is the main idea for the rest of the decade, buckle up. ~
    Logan Kilpatrick

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads | Facebook

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉 

Reply

or to participate.