• unwind ai
  • Posts
  • Opensource Dataset with 15 Trillion Tokens

Opensource Dataset with 15 Trillion Tokens

PLUS: Microsoft's $1.5B investment in G42, LLMs as the operating system, LLMs as cyber security expert

Today’s top AI Highlights:

  1. A dataset of high-quality 15 Trillion tokens now openly available

  2. GPT-4 resolves “one-day” cybersecurity flaws by reading descriptions and advisories

  3. Microsoft to invest $1.5 billion in a UAE-based AI company

  4. Embedding LLM into the operating system for better deployment of AI agents

  5. 3 ways to run Llama-3 locally (100% free and without internet)

& so much more!

Read time: 3 mins

Exciting Opportunity: Share how you use AI and get featured in Unwind AI! Details below.

Latest Developments 🌍

15T Tokens of the Finest Data the Web Has to Offer 🌐

Llama 3 models are trained on over 15 trillion tokens. Seems impossible to get such a huge dataset right? It’s not though. FineWeb is a dataset consisting of more than 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl dumps since 2013. This dataset not only surpasses the previously available RefinedWeb in terms of cleaning and deduplication processes but also includes open access to all related processing tools and models. The dataset is made freely available to encourage ongoing development and research in AI.

Key Highlights:

  1. Comprehensive Data: Includes data from 95 CommonCrawl dumps since 2013, spanning a wide array of topics and domains, all processed through the datatrove library to ensure high-quality, usable data for ML applications.

  2. Enhanced Processing and Tools: The release also provides all the necessary code and tools to replicate FineWeb’s processing pipeline.

  3. Performance evaluation: A series of 1.8B parameter models were trained on FineWeb along with other datasets like RefinedWeb, C4, Dolma v1.6, The Pile, and SlimPajama. The models trained on FineWeb outperformed those trained on other datasets on various standard benchmarks.

ablations

GPT-4 Exploits 87% One-Day Vulnerabilities 🔐

Cybersecurity challenges continue to evolve, pressing the need for advanced defense mechanisms that can keep pace with these cyber threats. Among these challenges, “one-day” vulnerabilities are a major risk. These are security flaws that have been publicly disclosed but not yet patched, creating a window during which attackers can exploit known weaknesses before fixes are applied.

In response to this, researchers used LLMs to identify and exploit these vulnerabilities autonomously. The focus has been on GPT-4 which showed remarkable performance compared to other models and traditional tools in cybersecurity tools.

  1. Method: Involves creating a benchmark consisting of 15 real-world one-day vulnerabilities, sourced from the Common Vulnerabilities and Exposures (CVE) database and academic papers. These included vulnerabilities of high or critical severity across various software types, providing a realistic environment to test the LLMs’ effectiveness.

  2. Results: GPT-4 was able to exploit 87% of these vulnerabilities when provided with detailed CVE descriptions, whereas other models like GPT-3.5 and open-source LLMs failed to exploit any of the vulnerabilities.

  3. Comparison with Other Tools: GPT-4 gave a superior performance over traditional vulnerability scanners like ZAP and Metasploit and other models like GPT 3.5 and opensource models, which had 0% success.

  4. Insights: The dependency of GPT-4 on CVE descriptions was evident, as its success rate plummeted to 7% without them. This shows current LLMs’ inability to identify vulnerabilities without explicit guidance, suggesting an area for further development.

Microsoft Expands its AI Influence in the Middle East

Microsoft has announced a $1.5 billion investment in G42, an AI firm based in the United Arab Emirates. This financial move involves G42 expanding its use of Microsoft Cloud and fully integrating its AI services with Microsoft Azure to co-develop AI solutions tailored to sectors like healthcare, government, and energy. The collaboration will focus on developing advanced AI solutions across various industries in the underserved areas of the Middle East, Central Asia, and Africa, utilizing Microsoft Azure.

As part of the agreement, G42 will migrate its AI operations including its Arabic LLM named Jais to Microsoft Azure. The partnership also includes an investment of $1 billion into a fund to develop AI skills in the region. This alliance is an important move for Microsoft to be the first mover in areas where the potential of AI has not been utilized, while also addressing regional needs for advanced technology solutions.

LLMs as the Brain of the Operating System 🧠

Deploying LLM-based AI agents has always been challenging, with issues like inefficient scheduling, difficulty maintaining context in conversations, and integrating various specialized agents. These challenges hinder the performance and scalability of such systems. Addressing these issues, researchers have developed AIOS, an LLM Agent Operating System that embeds LLMs into operating systems as the brain of the OS, enabling an operating system “with soul.” This integration can enhance operational capabilities and better resource management powering more sophisticated applications.

Key Highlights:

  1. Resource Allocation: AIOS improves the scheduling and allocation of computational resources among agents, which enhances overall system efficiency and reduces bottlenecks.

  2. Context Management: The system includes a context manager that not only maintains but can snapshot and restore interaction contexts, which is critical for ongoing dialogues and complex interactions requiring historical data.

  3. Concurrent Agent Execution: AIOS supports the simultaneous execution of multiple agents without loss of performance, ensuring high reliability and system stability even under varying operational loads.

  4. Comprehensive Access Control: A dedicated access manager module enforces strict privacy and access policies, securing sensitive operations and data within the system.

  5. External APIs: AIOS includes a tool manager that manages the interaction between agents and external APIs, allowing the agents to extend their capabilities by integrating various tools and services like simple data retrieval or complex scientific computing.

😍 Enjoying so far, share it with your friends!

Tools of the Trade ⚒️

  1. GPT4All: An open-source project that provides tools and software to run powerful LLMs on your computer, without needing access to expensive GPU hardware or cloud services.

  1. LM Studio: It is a tool that allows you to run Llama-3 and other opensource LLMs offline on your local PC. You can download the LM Studio application, install it, and then use it to download and run any opensource LLM completely offline.

  1. Ollama: An open-source project that provides an easy way to download and run LLMs like Llama-3 on regular PCs. It integrates with the OpenWebUI project, which gives you a web-based user interface to interact with the local LLMs.

Payman: AI that pays humans. It empowers AI agents with capital to create tasks they cannot complete independently, specifically those requiring human input. Humans can then access these tasks through a dedicated portal, complete them, and receive payment upon successful evaluation. This symbiotic system allows for efficient task completion and unlocks new possibilities for AI advancement through human collaboration.

AllMind AI: Your personal AI-powered financial investment analyst that helps you alter your portfolio based on the latest news. It can perform sentiment analysis, trend analysis, give details on the risk attached to a stock, and fundamental analysis based on the latest data. You can transform your financial analysis cutting research time by 90% and costs by 98%.

Text within this block will maintain its original spacing when published🌟 Spotlight on You: Share Your AI Use Case and Get Featured!

At Unwind AI, we’re all about real-world applications of AI tools. Whether you’re simplifying daily tasks, enhancing your projects, or exploring new possibilities, we want to celebrate how you use AI.

Participate in just a few easy steps. Send the following on Unwind AI’s email [email protected]

We will feature your story in our newsletter as detailed tutorials. It’s a great way to share your insights and get recognized.

We are eager to showcase your experiences and expand our collective understanding of practical AI applications. Let’s learn from each other and grow together!

Hot Takes 🔥

  1. About 5-10x more robotics companies will be started in the next 12 months. The first robots will be in the market in 2-3 years and will be expensive. ~Bindu Reddy

  2. Sense of urgency for an AI company right now is likely the number 1 predictor of success. ~Logan Kilpatrick

Meme of the Day 🤡

Image

That’s all for today! See you tomorrow with more such AI-filled content.

Real-time AI Updates 🚨

⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!

PS: I curate this AI newsletter every day for FREE, your support is what keeps me going. If you find value in what you read, share it with your friends by clicking the share button below!

Reply

or to participate.