unwind ai
Posts
Run Multiple LLMs on Shared GPUs

Run Multiple LLMs on Shared GPUs

PLUS: Zero-trust dev environment, Fastest API gateway

Shubham Saboo & Gargi Gupta
November 08, 2024

Today’s top AI Highlights:

Zero-trust development environments now run on your local machine
Deploy and serve multiple LLMs 10x faster with this opensource framework
Run AI agents and Code Interpreter together in your IDE
This high-performance API Gateway routes requests at blazing-fast speed
SDK to run various GGML and ONNX models locally

& so much more!

Read time: 3 mins

AI Tutorials

We’re always looking for ways to automate complex workflows. Building tools that can search, synthesize, and summarize information is a key part of this, especially when dealing with ever-changing data like news.

For this tutorial, we’ll create a multi-agent AI news assistant using OpenAI’s Swarm framework along with Llama 3.2. You’ll be able to run everything locally, using multiple agents to break down the task into manageable, specialized roles—all without cost.

We will use:

Swarm to manage the interactions between agents,
DuckDuckGo for real-time news search, and
Llama 3.2 for processing and summarizing news.

Each agent will handle a specific part of the workflow, resulting in a modular and flexible app that’s easy to adapt or expand.

We share hands-on tutorials like this 2-3 times a week, designed to help you stay ahead in the world of AI. If you're serious about levelling up your AI skills and staying ahead of the curve, subscribe now and be the first to access our latest tutorials.

Build a Multi-agent AI News Assistant with OpenAI Swarm and Llama 3.2

Multi-agent app with Llama 3.2 running locally on your computer (100% free) (step-by-step instructions)

Don’t forget to share this newsletter on your social channels and tag Unwind AI (X, LinkedIn, Threads, Facebook) to support us!

Latest Developments

Zero-Trust Development Environment 100% Locally 🖥️

Gitpod has launched Gitpod Flex, a new platform to automate your software development lifecycle using zero-trust environments. This means your code, data, and secrets stay secure within your private network while offering automated setup, customizable workflows, and seamless integration with popular tools. Flex runs on your laptop, cloud, or on-premises, giving you complete control over your development environment. Plus, it supports Dev Containers and offers new automation features for maximum efficiency.

Key Highlights:

Self-Service Environments with Automations - Automate tedious setup tasks like seeding databases, provisioning infrastructure, and even managing AI agents. You define these automations in YAML files right alongside your code.
Zero-Trust Security Built-In - Gitpod Flex keeps your source code and sensitive data within your own private network, addressing security concerns around remote development and AI coding. Every action is logged and audited, providing enhanced security.
Dev Container - If you’re already using Dev Containers, you can bring your existing configurations directly into Gitpod Flex. This means consistent development environments across your team, whether working locally or in the cloud.
Works Where You Do - Gitpod Flex is designed to run on your laptop (starting with Apple Silicon), in your cloud environment, or on-premises. This flexibility lets you use existing resources and choose the setup that best suits your requirements.
Quick Setup - You can start with Gitpod Desktop for local development with zero infrastructure costs, or deploy a self-hosted AWS runner. Setup requires only a devcontainer.json for environment configuration and an automations.yaml for defining custom workflows and tasks.

Run Multiple AI Models on Shared GPUs 🖥️🖥️🖥️

ServerlessLLM, a new open-source framework, simplifies deploying and serving of LLMs. It's designed for cost-effective multi-LLM serving, especially in environments with limited GPU resources. The framework boasts optimized performance and easy integration with popular tools like HuggingFace Transformers and the OpenAI Query API. Its serverless architecture allows dynamic loading and unloading of models for better resource utilization.

Key Highlights:

Efficient Resource Sharing - Multiple LLMs can efficiently share GPUs, minimizing costs associated with dedicated GPU setups and maximizing hardware utilization through dynamic loading and live migration of models.
Optimized Performance - Using vLLM and HuggingFace Transformers, ServerlessLLM achieves significantly faster loading speeds and lower latency compared to traditional methods. This results in a snappier user experience and reduced computational overhead.
Simplified Deployment - Deploy with ease using Ray Cluster and Kubernetes via KubeRay. Seamlessly integrate existing HuggingFace Transformers models or deploy custom-built models without complex configurations.
Flexible Storage Options - ServerlessLLM Store optimizes model checkpoint loading from various storage tiers (DRAM, SSD, HDD), offering flexibility in managing model data based on performance and cost requirements. This feature provides fine-grained control over storage utilization.

Quick Bites

Bolt.new, the browser-based full-stack development tool, now gives you granular control over which files the AI modifies, letting you selectively lock files or folders to prevent unintended changes. This is extremely useful for focusing on just a part of your app, preventing unintentional rewrites, etc.

AI code assistant CodeGPT now lets you search and use a wide library of AI agents and frameworks like Swarm AI assistant, Pandas Expert and CrewAI expert, and run tools like Code Interpreter directly in the IDE. This eliminates context switching and gives on-demand AI assistance for diverse coding tasks and data analysis within the IDE.

Hugging Face has released smol-tools, a collection of lightweight AI tools powered by Llama.cpp and small language models that run locally without GPU requirements. This suite features a summarizer, rewriter, and an AI agent that can perform various tasks through tool integration like web browser and weather lookup, built on SmolLM2-1.7B Instruct.

Mistral AI has released a content moderation API that detects undesirable text across multiple policy dimensions using the same tech powering Le Chat. This multilingual API offers two endpoints (raw text and conversational) and detailed technical documentation to help you build safer applications.

Tools of the Trade

Nexa SDK: A toolkit to run various AI models (including text, image, vision, and speech models) locally. It supports both ONNX and GGML model formats while offering an OpenAI-compatible API server and user interface.
MagicAPI AI Gateway: Fastest AI Gateway proxy, written in Rust and optimized for maximum performance. This high-performance API gateway routes requests to various AI providers (OpenAI, Grq) with streaming support, making it perfect for those who need reliable and blazing-fast AI API access.
LoRA Garden: A web application that allows you to search for LoRA models, organize them into personalized containers (gardens), and generate optimized prompts using selected models and user input. It uses the Civitai API to fetch model data and with the OpenAI API to generate prompts.
Awesome LLM Apps: Build awesome LLM apps using RAG to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos through simple text. These apps will let you retrieve information, engage in chat, and extract insights directly from content on these platforms.

Hot Takes

Dating as a founder in SF is never knowing if they’re interested in you or investing in your hot startup ~
Peggy Wang
We are just not used to abundant "intelligence" (of a sort), which leads people to miss a huge value of AI.
Don't ask for an idea, ask for 30. Don't ask for a suggestion on how to end a sentence, ask for 20 in different styles. Don't ask for advice, ask for many strategies. Pick
Humans curate.
And modify. And combine. And reject. And use the AI as inspiration. And know when to avoid using it for inspiration.
To be clear, working with AI well is not a passive process, but it also isn't limited to the patience of a human editor or co-author. ~
Ethan Mollick

That’s all for today! See you tomorrow with more such AI-filled content.

Don’t forget to share this newsletter on your social channels and tag Unwind AI to support us!

Unwind AI - X | LinkedIn | Threads | Facebook

Awesome LLM Apps | Sponsor Us

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one, two (or 20) of your friends 😉

Reply

or to participate.