- unwind ai
- Posts
- Fastest AI Inference for AI Agents
Fastest AI Inference for AI Agents
PLUS: OpenAI working on GPT-5, Rerank Model for RAG Apps
Today’s top AI Highlights:
Cerebras launches AI inference service; 20x faster at 1/5th cost
Together AI and Salesforce team up to supercharge RAG
OpenAI is developing new flagship AI model with Strawberry as the secret sauce
Google releases experimental model Gemini 1.5 Flash 8B
Claude’s Artifacts are now available for free users
& so much more!
Read time: 3 mins
Latest Developments
Cerebras has launched its new AI inference solution, boasting speeds 20x faster than those found in NVIDIA GPU-based hyperscale clouds. Using their third-generation Wafer Scale Engine, they achieve 1,800 tokens per second with the Llama3.1 8B model and an impressive 450 tokens per second with the Llama3.1 70B model. This performance comes at a significantly lower cost, with Cerebras charging 1/5th the cost of comparable solutions. You can access this technology today through their API, which has generous rate limits and uses 16-bit weights for maximum accuracy.
Key Highlights:
Blazing fast speeds - Cerebras achieves 20x faster inference than hyperscale clouds. It is 2.4x faster than Groq in Llama3.1-8B, making it ideal for applications requiring real-time responses.
Cost-effective - Cerebras makes large-scale LLM deployment more affordable at $0.10 per million tokens for the 8B model and $0.60 per million tokens for the 70B model, compared to $2.90 per million tokens on comparable cloud platforms.
Maintains Accuracy - Cerebras uses native 16-bit weights, ensuring the highest accuracy responses without compromising performance.
Easy Integration - You can easily integrate Cerebras inference into existing applications using their API, which follows the familiar OpenAI Chat Completions format.
Salesforce and Together AI have partnered to enhance enterprise search and RAG systems. Salesforce has developed LlamaRank, a new state-of-the-art reranker model that outperforms existing models in accuracy and efficiency. Together AI is offering exclusive access to LlamaRank through its new serverless Together Rerank API, making it easy for developers to integrate this technology. This collaboration aims to improve information retrieval by filtering out irrelevant documents and boosting the accuracy of search results.
Key Highlights:
Performance - LlamaRank surpasses other leading reranker models like Cohere Rerank v3 and Mistral-7B QLM, particularly excelling in code search.
Efficiency - LlamaRank enables faster and more cost-effective RAG systems by reducing the processing of irrelevant documents before they reach the language model.
Integration - The Together Rerank API allows developers to easily incorporate LlamaRank and other compatible reranker models (like Cohere Rerank) into their applications with just a few lines of code.
Extended Capabilities - LlamaRank supports larger document sizes (up to 8,000 tokens) and can handle semi-structured data like JSON, email, tables, and code, increasing its versatility in enterprise environments.
Quick Bites
OpenAI is reportedly developing two new AI models: Strawberry and Orion. Strawberry focuses on solving complex math and programming problems, while Orion aims to surpass GPT-4.
Strawberry will tackle previously unseen math problems and optimize programming tasks.
Strawberry’s technology will help create advanced AI agents that can plan ahead and take actions on behalf of the user.
Orion will be OpenAI’s new flagship model designed to outperform GPT-4. Strawberry will contribute by generating data for Orion.
It's uncertain whether Strawberry will launch this year. If released, it would be a distilled version of the original model.
Anthropic is making Artifacts available for all Claude.ai users across our Free, Pro, and Team plans. And now, you can also create and view Artifacts on Claude’s iOS and Android apps.
Google has launched three new experimental AI models: Gemini 1.5 Flash-8B, Gemini 1.5 Pro, and an improved Gemini 1.5 Flash, now ready for developer testing.
Gemini 1.5 Flash-8B: An optimized 8 billion parameter variant, suitable for high-volume multimodal tasks and long-context summarization, available via Google AI Studio and Gemini API (gemini-1.5-flash-8b-exp-0827).
Improved Gemini 1.5 Pro and Flash: Significant performance upgrades across benchmarks, with stronger results on coding and complex prompts, accessible under gemini-1.5-pro-exp-0827 and gemini-1.5-flash-exp-0827.
Model updates: Free testing tiers are available, and starting September 3, older model gemini-1.5-pro-exp-0801 will be automatically redirected to the new version to simplify access.
Tools of the Trade
Selective Context: Compresses your prompt and context for LLMs to process 2x more content and save 40% memory. It is especially useful in dealing with long documents and maintaining long conversations without compromising performance.
Sparkle: Uses AI to automatically organize files in your Downloads, Desktop, and Documents folders by creating a custom folder system based on how you work. It manages both new and existing files, so you never have to manually organize them again.
Dataleap: AI assistant for market research combining ChatGPT-style interface with a curated database of market data. Every answer is firmly linked to its source.
Awesome LLM Apps: Build awesome LLM apps using RAG to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos through simple text. These apps will let you retrieve information, engage in chat, and extract insights directly from content on these platforms.
Hot Takes
Put another way - OAI doesn't want to release Strawberry because they worry that competitors will use their API to create synthetic data to train powerful models. ~
Bindu ReddyThis is a tough call and will make some people upset, but, all things considered, I think California should probably pass the SB 1047 AI safety bill. ~
Elon Musk
Meme of the Day
That’s all for today! See you tomorrow with more such AI-filled content.
Real-time AI Updates 🚨
⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!
PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one (or 20) of your friends!
Reply