RAG for Videos

PLUS: Lightweight multi-agent framework, O1 preview-level model within $450

Today’s top AI Highlights:

  1. RAG on video content directly, not just transcripts

  2. Lightweight open-source framework for multi-agent apps

  3. Mistral releases better and faster version of Codestral model

  4. Train your own O1 preview model within $450

  5. Opensource version of Perplexity

& so much more!

Read time: 3 mins

AI Tutorials

The demand for AI-powered data visualization tools is surging as businesses seek faster, more intuitive ways to understand their data. We can tap into this growing market by building our own AI-powered visualization tools that integrate seamlessly with existing data workflows.

In this tutorial, we'll build an AI Data Visualization Agent using Together AI's powerful language models and E2B's secure code execution environment. This agent will understand natural language queries about your data and automatically generate appropriate visualizations, making data exploration intuitive and efficient.

E2B is an open-source infrastructure that provides secure sandboxed environments for running AI-generated code. Using E2B's Python SDK, we can safely execute code generated by language models, making it perfect for creating an AI-powered data visualization tool

We share hands-on tutorials like this 2-3 times a week, designed to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Latest Developments

Creating a RAG system for videos can be a development nightmare: meticulous frame sampling, visual feature extraction, careful time-based chunking, and complex transcript alignment—all before you even get to indexing. The diverse video formats and the need to handle visual, temporal, and textual information further amplify this challenge.

VideoRAG is an end-to-end framework that alleviates this burden. Instead of building everything from scratch, it uses Large Video Language Models (LVLMs) to handle video understanding, retrieval, and answer generation. It dynamically retrieves relevant videos based on their relevance with queries, utilizing both visual and textual information of videos for the output.

Key Highlights:

  1. On-the-Fly Video Selection - VideoRAG uses LVLMs to analyze the query and video content to select only the most relevant videos from a large corpus. Your RAG app will automatically find relevant information based on the query and the video content.

  2. Unified Modality Encoding - With VideoRAG you don't have to worry about creating separate components or algorithms for analyzing different parts of video data. It uses the pre-trained capabilities of LVLMs to handle and process visual frames and textual data with a single encoding process.

  3. Model Integration - Supports different LVLMs for retrieval and generation tasks, so you can optimize for your specific requirements. For example, using models specialized in semantic alignment for retrieval while employing more advanced models for generation.

  4. Integration with ASR - No existing video transcripts? No problem. The framework uses Automatic Speech Recognition model to generate transcripts from audio automatically. This allows your app to handle any video regardless of whether it has subtitles or not.

Orchestra is a new open-source agentic framework that puts tasks at the center of AI workflows instead of conversation patterns. Built with a modular architecture, it lets you create LLM-driven pipelines and multi-agent teams while keeping things lightweight with minimal dependencies.

What makes Orchestra stand out is its task-centric design that mirrors how real organizations work - you can define discrete units of work with clear inputs and outputs, similar to standard operating procedures. The framework exposes all prompts and maintains a flat hierarchy, giving you full visibility and control over how your AI systems operate.

Key Highlights:

  1. Tool Integration - Orchestra's tool system lets agents dynamically select and use tools to solve problems. Agents can make parallel tool calls to gather different pieces of information simultaneously, execute tools in sequence where one tool's output feeds into another, or use tools recursively to refine results.

  2. Multi-Agent Orchestration - The framework handles complex agent interactions. Agents can delegate tasks, maintain conversation histories, and pass data between each other while keeping track of dependencies. You can build hierarchical structures where agents coordinate across departments - perfect for scaling up from simple workflows to enterprise-grade systems.

  3. Production-Ready Features - Orchestra comes with built-in tools for everything from file operations to API integrations. Error handling, retries, and maximum iteration limits are baked in to prevent issues like infinite loops. The system supports streaming responses and maintains clear audit trails of agent actions.

  4. Developer-First Design - Getting started is straightforward with minimal boilerplate code. Orchestra works with popular models from OpenAI, Anthropic, Groq and others through a consistent interface. The modular architecture lets you swap components and add new tools without touching core functionality.

Quick Bites

Mistral AI just released Codestral 25.01, a significantly faster coding model available now in Continue.dev and soon on other platforms. The new version boasts an improved architecture and tokenizer, doubling its speed for code generation and completion tasks, and achieving SOTA performance in fill-in-the-middle (FIM) scenarios.

  • Supports 80+ programming languages, making it a versatile option for varied projects.

  • Has a 256k context window to understand and generate code based on larger codebases.

  • Currently ranked #1 on the LMsys copilot arena leaderboard.

LlamaIndex released vdr-2b-multi-v1, an embedding model for visual document retrieval across multiple languages and domains. It encodes document page screenshots into dense single-vector representations, to search and query visually rich multilingual documents without any OCR, data extraction pipelines, chunking. The model is available on Hugging Face.

UC Berkeley researchers released Sky-T1, a 32B O1 Preview-comparable reasoning model, trained for under $450. Built by fine-tuning Qwen2.5-32B-Instruct over high-quality 17k examples, the model performs on par with o1-preview on popular reasoning and coding benchmarks. All the details (i.e., data, codes, model weights) have been open-sourced.

OpenAI has released an updated function calling guide, now 50% shorter and clearer. The guide includes new best practices emphasizing software engineering principles for defining functions, such as making them obvious and using code for tasks instead of relying on the model. It also has in-doc function generation and a complete example using a weather API.

Tools of the Trade

  1. Open WebUI: Self-hosted UI for running LLMs (like Ollama models) and interacting with OpenAI-compatible APIs. It supports RAG and has features for function calling, web browsing, and code execution, all configured via a pipeline system.

  2. Scira: Open-source, AI-powered search engine built with Next.js and the Vercel AI SDK. You can search the web, specific URLs, and get information on current weather, maps, YouTube videos, and more, using models like Grok 2.0.

  3. Lopus AI: A React SDK that produces custom front-end components on the fly, using developer-defined tools. Instead of forcing everyone into the same layout, Lopus lets users ask for the experience they want—no more digging through pages and menus.

  4. Awesome LLM Apps: Build awesome LLM apps with RAG, AI agents, and more to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos, and automate complex work.

Hot Takes

  1. Do yourself a favor: don't use chat LLMs for learning directly from them. You will learn falsehoods about all topics, and this is what will shape who you are because we are the information we consume. ~
    Andriy Burkov

  2. If you have “AGI” why are you selling it through an API? ~
    anton

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads | Facebook

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉 

Reply

or to participate.