- unwind ai
- Posts
- AI Agent for Writing CUDA Kernels
AI Agent for Writing CUDA Kernels
PLUS: Foundation model for multimodal AI agents, Infinite context for your LLM apps
Today’s top AI Highlights:
AI Agent writes CUDA code that runs 100x faster than PyTorch
China’s new research makes LLMs and Agents process entire codebases without context limits
Google’s PaliGemma 2 for multi-task vision capabilities out-of-the-box
Microsoft’s foundation model for multimodal AI agentic tasks
No-code tool to generate production-ready backend APIs in seconds
& so much more!
Read time: 3 mins
AI Tutorials
Finding the perfect property involves sifting through countless listings across multiple websites, analyzing location trends, and making informed investment decisions. For developers and real estate professionals, automating this process can save hours of manual work while providing deeper market insights.
In this tutorial, we'll build an AI Real Estate Agent that automates property search and market analysis. It helps users find properties matching their criteria while providing detailed location trends and investment recommendations. This agent streamlines the property search process by combining data from multiple real estate websites and offering intelligent analysis.
Tech Stack:
Firecrawl's Extract Endpoint to collect structured data from websites
Agno (formerly Phidata) for building the AI agent
OpenAI GPT-4o as the LLM
Streamlit for a clean, interactive web interface.
We share hands-on tutorials like this 2-3 times a week, designed to help you stay ahead in the world of AI. If you're serious about leveling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.
Latest Developments

China’s Moonshot AI has released a new attention mechanism called Mixture of Block Attention (MoBA) that lets LLMs handle incredibly long contexts. This open-source innovation allows for processing of entire codebases or documents in one go, effectively eliminating context window limitations we currently face.
MoBA works by splitting content into blocks and letting each query token focus only on relevant sections, similar to how Mixture of Experts operates. The framework processes 10M tokens 16x faster than traditional attention while maintaining comparable performance, and it can seamlessly switch between sparse and full attention modes during operation.
Key Highlights:
Massive Context Handling - MoBA processes huge inputs (entire codebases, long documents) by dividing them into blocks and using a parameter-less gating mechanism. This selects only the relevant blocks for each query, slashing computational needs.
Direct Replacement for Standard Attention - MoBA is a drop-in replacement for standard attention in Transformers. Integrate it into existing models without major architectural changes.
Hybrid Attention Flexibility - Seamlessly switch between MoBA's sparse attention and full attention during training or inference. This lets you optimize for speed or precision – use full attention on critical tokens/layers, MoBA elsewhere.
Comparable Performance, Huge Speed Gains - MoBA matches full attention's performance on benchmarks, but is much faster. Moonshot AI reports up to 16x speedups for 10M tokens, reducing compute costs and enabling previously impossible tasks.
Open-Source - The code is available here. Its parameter-less gating system keeps models lightweight while automatically finding the best block selection patterns for your data.

Sakana AI has developed a new agentic AI system The AI CUDA Engineer that creates and optimizes CUDA kernels. This AI agent specializes in GPU optimization, tackling one of the biggest challenges in machine learning - making models run faster on NVIDIA hardware.
This system leverages evolutionary algorithms to not only convert code but also significantly enhance its performance, often achieving speedups of 10-100x. The company is also open-sourcing a massive dataset of over 17,000 verified CUDA kernels it generated. This will save you a lot of development time, or at the least, give you inspiration for interesting kernel optimizations.
Key Highlights:
Agent-Driven Optimization - The AI agent breaks down complex CUDA optimization into manageable steps - first translating PyTorch code to CUDA, then continuously improving performance through evolutionary learning. This means you can focus on building your ML models while the agent handles the low-level GPU optimization.
Production-Ready Performance - The system consistently delivers speedups that matter, outperforming PyTorch native runtimes in 81% of operations tested. Many of the generated kernels run at least 2x faster than standard implementations, with some achieving up to 5x speedups over existing production CUDA kernels.
Smart Learning System - The agent doesn't just optimize - it learns and adapts. Using evolutionary techniques, it combines successful kernels to create even better ones, while maintaining an archive of proven optimizations. This means the system gets smarter with each optimization task it tackles.
Developer Resources - Jump start your optimization work with access to 17,000+ verified CUDA kernels through the open dataset. Each kernel comes with reference implementations, detailed profiling data, and clear performance metrics so you can evaluate and integrate them into your projects.
Quick Bites
Google has released PaliGemma 2 mix, a pre-tuned version of their vision-language model that handles multiple tasks out-of-the-box, including image captioning, OCR, object detection, and segmentation - all within a single model. Available in three sizes (3B, 10B, and 28B parameters), these models integrate seamlessly with Hugging Face Transformers, Keras, and PyTorch.
Figure Robotics, after ending its partnership with OpenAI, has unveiled Helix, its in-house first-of-its-kind "System 1, System 2" Vision-Language-Action (VLA) model for humanoid robots. Helix is impressively capable, controlling a humanoid's full upper body, performing collaborative tasks between multiple robots, and grasping a wide variety of novel objects using only natural language commands.
System 2 is based on a surprisingly small VLM with 7B parameters, which when combined with System 1 (a fast reactive visuomotor policy), enables real-time, adaptable control directly on the robot's hardware.
Microsoft Research has released Magma, a groundbreaking foundation model for multimodal AI agents, bridging the gap between understanding and action in both digital and physical environments. The model can interpret multimodal inputs and execute complex tasks ranging from UI navigation to robot manipulation, demonstrating state-of-the-art performance. The MSR team will be releasing the code, model and UI navigation demo on this 25th.
DeepAuto.ai has developed InfiniteHiP, an inference framework that allows LLMs to process up to 3 million tokens on a single 48GB GPU - triple the typical capacity - through innovative token pruning and memory management techniques. The framework achieves nearly 19x faster attention processing for million-token contexts without requiring additional training, while preserving full context information.
Tools of the Trade
Devv Builder: A no-code platform that converts natural language descriptions into production-ready backend APIs, handling everything from code generation and database setup to testing and deployment in a serverless environment.
DeepEval: Open-source LLM evaluation framework, for evaluating and testing LLMs. It is similar to Pytest but specialized for unit testing LLM outputs. It incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., using LLMs running locally on your machine for evaluation.
apple-mcp: A Model Context Protocol server that grants LLMs access to Apple's native applications like Contacts, Notes, and iMessages. It enables models like Claude, via a simple configuration, to execute actions like retrieving notes and sending messages, directly by running a single bun command.
TensorPool: CLI tool that lets you deploy ML training jobs directly from your IDE to cloud GPUs across multiple providers, handling all infrastructure orchestration. It saves ~50% of your cost through spot instance management and real-time price optimization.
Awesome LLM Apps: Build awesome LLM apps with RAG, AI agents, and more to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos, and automate complex work.

Hot Takes
For better & worse:
The labs are making AI models, they do not have use cases for the models; that is for users (big & small) to discover
The labs are making AI models, they do not have full policy recommendations on how to adapt to them; that is for governments to figure out ~
Ethan Mollickall backend engineers should make at least 500k.
all frontend engineers should make at most 500k. ~
Kevin Naughton Jr.Disappointing to see the incentives for the grok team to cheat and deceive in evals.
Tl;dr o3-mini is better in every eval compared to grok 3.
Grok 3 is genuinely a decent model, but no need to over sell. ~
Boris Power
That’s all for today! See you tomorrow with more such AI-filled content.
Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!
PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉
Reply