• unwind ai
  • Posts
  • LLM for Lightning Fast AI Search

LLM for Lightning Fast AI Search

PLUS: Phind-405B matches Claude 3.5 Sonnet, vLLM upgrade for Llama 3.1

Today’s top AI Highlights:

  1. Phind releases faster, more powerful models for coding and AI search

  2. Opensource library for fast Llama 3.1 inference and low latency

  3. Nvidia accused of copying a patented data processing unit technology

  4. Opensource implementation of Google DeepMind’s AlphaFold 3

  5. Claude gets long-term memory similar to OpenAI ChatGPT

& so much more!

Read time: 3 mins

Latest Developments

AI search engine Phind has launched its new flagship model, Phind-405B, trained on Meta Llama 3.1 405B, which is excellent at programming and technical tasks. Phind-405B utilizes a 128K token context and achieves a 92% score on HumanEval (0-shot), on par with Claude 3.5 Sonnet.
Along with this, they have also released a new Phind Instant model, based on Meta Llama 3.1 8B, for significantly faster AI-powered search speeds. It offers speeds of up to 350 tokens per second, powered by a customized Nvidia TensorRT-LLM inference server.

Key Highlights:

  1. Phind-405B excels in real-world tasks - Notably strong performance in designing and implementing web apps, like creating landing pages based on research.

  2. FP8 mixed precision training - Phind-405B utilizes this method for a 40% memory usage reduction without compromising training quality compared to BF16 precision.

  3. Phind Instant delivers rapid search - Trained on a similar dataset to Phind-405B, it offers a near-instantaneous search experience compared to traditional AI-powered search.

  4. Enhanced search functionality - Features like prefetching web results based on user input and upgraded embeddings for improved relevance determination are implemented, leading to faster and more accurate searches.

  5. Availability - Phind-405B is accessible to all Phind Pro users, while the Phind Instant model is already integrated into the platform.

vLLM is an opensource library for accelerating inference for LLMs, especially those in the Llama family. The latest release vLLM v0.6.0 delivers significant performance enhancements for Llama 3.1 inference. This update addresses key bottlenecks in the previous version, resulting in up to 2.7x higher throughput and 5x faster token generation times for Llama 3.1 8B, and comparable gains for Llama 70B. vLLM 0.6.0 is now a top contender among LLM inference engines for performance.

Key Highlights:

  1. Throughput Boost - Achieves up to 2.7x higher throughput for Llama 3.1 8B and 1.8x for Llama 3.1 70B compared to the previous version, enabling faster processing of a larger volume of requests.

  2. Latency Reduction - Delivers up to 5x faster time-per-output-token (TPOT) for Llama 3.1 8B and 2x faster TPOT for Llama 3.1 70B, significantly reduced response times for LLM interactions.

  3. Reduced CPU Overhead - Decoupling the API server and inference engine, along with optimized data structures, leads to a significant reduction in CPU overhead, freeing up resources for core inference tasks.

  4. Multi-Step Scheduling - By batching multiple scheduling steps, vLLM minimizes GPU idle time, resulting in a 28% improvement in throughput for Llama 70B on 4xH100 GPUs.

  5. Get Started Easily - Quickly install and deploy vLLM v0.6.0 using pip and easily integrate it into your existing LLM workflows with minimal code changes. Detailed instructions and examples are available in the vLLM documentation.

Quick Bites

Texas startup Xockets is suing NVIDIA for patent infringement alleging that Nvidia’s DPUs — BlueField, ConnectX, and NVLink Switch — are based on Xockets’ patented technology.
Xockets also accuses Nvidia and Microsoft of participating in a buyers’ cartel, using a company called RPX to coordinate and lower prices for AI server technology, giving them control over the market.

Google Cloud has integrated Anthropic’s Claude AI with BigQuery for businesses to perform tasks like summarizing, translating, and analyzing data directly within their platform. This makes advanced AI capabilities more accessible to users through SQL and BigQuery ML.

AI robotics startup Weave is building Isaac, a personal robot for helping you with the house chores. The company has started taking pre-orders for its robot, costing $1000 to reserve an Isaac. They are planning to deliver their first batch of 30 customers in 2025 fall.

AI startup Ligo is building the opensource implementation of DeepMind’s frontier model, AlphaFold 3. While in the early research phase, this model currently supports single-chain protein predictions, and will soon include ligand, multimer, and nucleic acid predictions. You can join the waitlist for early beta testing.

Tools of the Trade

  1. Claude Memory: A Chrome extension giving Claude long-term memory functionality. It allows you to store and retrieve important information from your conversations with Claude for more personalized and context-aware outputs.

  2. Illuminate by Google: Converts academic papers to audio content for better understanding. It generates audio with two AI voices in conversation, discussing the key points of papers. It is currently optimized for published computer science academic papers.

  3. Haptic: Opensource markdown note-taking app that is lightweight and designed for local-first storage. It supports desktop and web deployment, offering a minimal and efficient user experience.

  4. Awesome LLM Apps: Build awesome LLM apps using RAG to interact with data sources like GitHub, Gmail, PDFs, and YouTube videos through simple text. These apps will let you retrieve information, engage in chat, and extract insights directly from content on these platforms.

Hot Takes

  1. I continue to wonder about the plan for Grok
    Grok 2 is a good GPT-4 class model with some odd quirks & which is embedded in Twitter (no one's idea of productivity app). The rate of training suggests Grok 3 may be an early GPT-5 class model, what is the use case xAI sees for it? ~
    Ethan Mollick

  2. sf gossip:
    - meta is struggling to unplug all their old A100s to plug their H100s instead
    - (old but still true) the bottleneck is not gpus but parallelization over datacenters with 20k H100s
    - 3.5 opus will be less surprising than sonnet ~
    Michaël Trazzi

Meme of the Day

That’s all for today! See you tomorrow with more such AI-filled content.

Real-time AI Updates 🚨

⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!

Unwind AI - Twitter | LinkedIn | Instagram | Facebook

PS: We curate this AI newsletter every day for FREE, your support is what keeps us going. If you find value in what you read, share it with at least one (or 20) of your friends!

Reply

or to participate.