• unwind ai
  • Posts
  • Opensource AI Model for Video Understanding

Opensource AI Model for Video Understanding

PLUS: GPT-4-level code assistance locally, Chat with AI to build web apps

Today’s top AI Highlights:

  1. Alibaba opensources new Qwen vision language models with video understanding

  2. Get GPT-4-level code assistance locally in VS Code with KTransformers

  3. California State Assembly passes SB 1047 AI safety bill

  4. Apple and Nvidia might invest in OpenAI’s next funding round

  5. AI pair programmer right in your terminal

& so much more!

Read time: 3 mins

Latest Developments

Alibaba Cloud has released Qwen2-VL, a new suite of vision-language models built upon their existing Qwen2 language models. Qwen2-VL boasts significant improvements over its predecessor, Qwen-VL, in image and video understanding, multilingual support, and even agent-based capabilities for device control. This release includes opensource versions of the 2B and 7B parameter models, along with API access to the powerful 72B parameter model.

Key Highlights:

  1. Performance - The 72B parameter model surpasses even closed-source models like GPT-4o and Claude 3.5 Sonnet in many visual understanding tasks, particularly in document comprehension.

  2. Multi-format and Multilingual - Qwen2-VL can handle images of various resolutions, videos up to 20 minutes long, and even multi-image inputs for comprehensive visual analysis. The model also understands text within images across multiple languages, including various European languages, Japanese, Korean, Arabic, and Vietnamese.

  3. Visual Agent Features - Qwen2-VL can function as an agent, using "function calling" to access external tools based on visual cues and enabling more interactive applications.

  4. Access - The 2B and 7B parameter models are opensourced under the Apache 2.0 license, and API access is available for the 72B model through Alibaba Cloud's DashScope platform.

KTransformers is a new framework to speed up LLM inference in hardware-constrained environments. By implementing and injecting an optimized module with a single line of code, you get a Transformers-compatible interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI. This Python-centric framework offers extensive kernel optimizations and placement strategies to run high-performance models locally.

Key Highlights:

  1. Speed Improvements with Sparse Attention - The latest updates to KTransformers enable the InternLM2.5-7B model to manage 1 million tokens using just 24GB of VRAM and 150GB of DRAM. With sparse attention, the model achieves a 5.65x speedup in token generation, reaching 27.49 tokens/second. This makes it highly efficient for running extensive LLM tasks locally on a desktop.

  2. Local Code Assistance - The DeepSeek-Coder-V2 model, optimized to run on as little as 11GB of VRAM, brings GPT-4 level coding support directly within VSCode. This lets you use powerful LLMs locally for coding assistance, integrated smoothly with tools like Tabby.

  3. Broad Model Support - KTransformers supports more models, including Mixtral 8×22B and Qwen2 57B so you can choose the best model for their specific needs.

Quick Bites

  1. Google has rolled out prompt auto-saving and auto-naming in Google AI Studio. This is enabled by default which you can turn off in your settings.

  2. The U.S. AI Safety Institute has entered into a first-of-its-kind agreement with OpenAI and Anthropic to get access to their major new AI models before and after public release. The Institute plans to provide feedback to these companies on safety improvements.

  3. California’s legislature has passed the first significant AI safety bill SB 1047, which requires AI companies to implement strict safety measures before training advanced models. The bill is now with Governor Newsom for approval by the end of September.

  4. Apple and Nvidia are reportedly considering investing in OpenAI’s upcoming funding round, which could value the company at over $100 billion.

Tools of the Trade

  1. GPT Engineer: Build web apps by chatting with AI, which generates the code and provides a live preview. It integrates with GitHub for version control and lets you make changes using simple language.

  1. Edit by Resemble AI: Modify recordings as easily as you would a text document, using a chat interface. It also includes features like AI voice cloning, filler word removal, and automatic audio enhancement.

  2. Aider: AI pair programming to help you edit code directly in your terminal, working seamlessly with your local git repository. It supports various LLMs like GPT-4o and Claude 3.5 Sonnet.

  3. Awesome LLM Apps: Build awesome LLM apps using RAG to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos through simple text. These apps will let you retrieve information, engage in chat, and extract insights directly from content on these platforms.

Hot Takes

  1. Big issue in organizations: They have put together elaborate rules for AI use focused on negative use cases.
    As a result, employees are too scared to talk about how they use AI, or to use corporate LLMs. They just become secret cyborgs, using their own AI & not sharing knowledge ~
    Ethan Mollick

  2. can someone who understands business economics better than i explain why oai doesn’t release q* for like $50 per prompt and why anthropic won’t release claude-3.5-deus for $500/1M tokens?
    upsides:
    >update public on capabilities
    >make a lot of money
    downsides:
    >? ~
    Aidan McLau

Meme of the Day

That’s all for today! See you tomorrow with more such AI-filled content.

Real-time AI Updates 🚨

⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!

Unwind AI - Twitter | LinkedIn | Instagram | Facebook

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one (or 20) of your friends!

Reply

or to participate.