unwind ai
Posts
AI Generated Data Leads to AI Model Collapse

AI Generated Data Leads to AI Model Collapse

PLUS: 10x cheaper RAG, Adobe's new AI tools for generating mockups

Shubham Saboo & Gargi Gupta
July 29, 2024

Today’s top AI Highlights:

AI’s echo chamber: training models on AI-generated data leads to “model collapse”
Opensource model for 10x cheaper knowledge graph construction for RAG
Adobe’s new AI features to quickly create mockups and detailed vector shapes
Driverless cars perform perfect Tandem Drift; thanks to AI
Opensource, self-hosted AI coding assistant

& so much more!

Read time: 3 mins

Latest Developments 🌍

Long-term Effects of Using AI-Generated Training Data 😵‍💫

We’re increasingly using LLMs to generate content. These models have been mostly trained using human-generated data, but this may change. If the training data of most future models, for eg., GPT-8 is also scraped from the web, then they will inevitably train on data generated by their predecessors, for eg., GPT-4 and 5.

What happens when models are increasingly trained on AI-generated data? A study shows that models develop a phenomenon called “model collapse” that threatens the long-term viability of training AI models. The models lose sight of the original data’s true complexity and diversity, making its output progressively unreliable.

Key Highlights:

Model collapse progressively corrupts AI understanding - Initially, AI models trained on synthetic data struggle to represent unusual or less frequent events, which are present in the “tails” of the data distribution. As this process continues, the models’ understanding of the data narrows, ultimately leading to a complete misrepresentation of the original data.
LLMs exhibit vulnerability to model collapse - LLMs, fine-tuned over multiple generations on AI-generated text, decline in performance and increasingly produce repetitive and nonsensical outputs.
Even a small amount of real data makes a difference - Incorporating even a small percentage of original, human-generated data during training can significantly counteract the effects of model collapse, preserving the model's accuracy and ability to generalize.
The importance of data provenance - As AI-generated content grows online, distinguishing between human and AI authorship will be crucial for maintaining the integrity of training datasets.

Opensource LLM for Constructing Knowledge Graphs for RAG 🕸️

Microsoft’s recent paper on Knowledge Graphs for RAG gained wide attention. Knowledge graphs use LLMs with tailored prompts to organize information in a structured format of entities and their relationships to make sense of distributed information over a large corpus of data.

However, this knowledge graph construction method is expensive, requiring at least one generated output token for every ingested input token, making it impractical. SciPhi, a company focused on deploying and scaling RAG, has released a new model to unstructured data into structured knowledge graphs. It surpasses GPT-4o at knowledge graph construction for less than 1/10th the cost!

Key Highlights:

Cost Reduction - This significant cost reduction by 10x is made possible by Triplex’s smaller model size and its ability to operate without the need for few-shot context.
High-Performance - Triplex extracts “semantic triples,” the fundamental subject-predicate-object units of a knowledge graph, directly from text. For example, for input “Paris is the capital of France”, Triplex extracts the triples: “CITY: Paris > CAPITAL_OF > COUNTRY: France” and “CITY: Paris > LOCATED_IN > COUNTRY: France.”
Evaluation - Evaluation using Claude 3.5 Sonnet shows that Triplex outperforms GPT-4o in accuracy for knowledge graph construction tasks.
Ready to Use - Triplex is available on HuggingFace and Ollama. SchiPhi also provides the R2R RAG engine, designed to work seamlessly with Triplex and Neo4J for local knowledge graph construction.

Quick Bites 🤌

Adobe has supercharged Illustrator and Photoshop with new AI-powered features and workflow enhancements, empowering creatives to design faster and smarter.
- Firefly-powered AI features like Generative Shape Fill in Illustrator and Generate Image in Photoshop let you add intricate vector elements to shapes or create stunning images with simple text prompts.
- Create realistic product mockups with the new Mockup feature in Illustrator, precisely measure dimensions for accurate artwork scaling with the Dimension Tool, and easily select and apply adjustments to specific areas of images using the new Brush tools.
- Experiment with different design options effortlessly using features like text-to-pattern generation in Illustrator and style transfer capabilities.
Stanford Engineering and Toyota Research Institute have created the world’s first autonomous AI-directed Tandem Drift team, with driverless cars guided by AI to perform complex drifting maneuvers. This breakthrough aims to advance AI’s potential to improve safety in automated driving on public roads.
Elon Musk’s X platform is under scrutiny from UK and Irish data regulators for using default settings that consent users’ posts to train Grok AI, potentially violating GDPR rules. Regulators demand transparency and proactive user notification regarding data usage for AI training.

😍 Enjoying so far, share it with your friends!

Tools of the Trade ⚒️

Tabby: An open-source, self-hosted AI coding assistant. With Tabby, every team can set up its own LLM-powered code completion server with ease.
Supermemory - AI second brain for all your saved stuff. It helps you organize, search, and use your saved information efficiently, with features like a search engine, writing assistant, and canvas. It’s free, integrates with popular apps, and opensource.
OpenPlexity Pages: Opensource alternative to Perplexity Pages that transforms research into structured articles, allowing for customization and running locally. It cannot produce publication-ready articles but is useful for the initial writing phase and can enhance content with AI-generated visuals.
Awesome LLM Apps: Build awesome LLM apps using RAG for interacting with data sources like GitHub, Gmail, PDFs, and YouTube videos through simple texts. These apps will let you retrieve information, engage in chat, and extract insights directly from content on these platforms.

Hot Takes 🔥

Crazy that Llama 3.1 405B was trained on 16k H100s
And by the end of the year, multiple labs are going to be working on/shipping models trained on closer to 100k H100s
We ain’t seen nothing yet ~
Matt Shumer
We're thrilled to announce the latest addition to the GPT-4o family: GPT-4o-Nano. This model takes the lightweight approach of GPT-4o and 4o-mini to a whole new level, providing unparalleled efficiency without compromising performance. ~
Flowers

Meme of the Day 🤡

If you can’t beat them join them

That’s all for today! See you tomorrow with more such AI-filled content.

Real-time AI Updates 🚨

⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!

PS: We curate this AI newsletter every day for FREE, your support is what keeps me going. If you find value in what you read, share it with your friends by clicking the share button below!

Reply

or to participate.