• unwind ai
  • Posts
  • Diffusion Model Beats ChatGPT in Coding 💪

Diffusion Model Beats ChatGPT in Coding 💪

PLUS: 30T Tokens Open Dataset for LLM Training, Biden's New Executive Order on AI Safety

Today’s top AI Highlights:

  1. CodeFusion: A Pre-trained Diffusion Model for Code Generation

  2. RedPajama-Data-v2: Open Dataset with 30 Trillion Tokens for Training LLMs

  3. President Biden Takes Aim at AI Safety with New Executive Order

  4. Zephyr-7B Beats 7B Models, Rivals 70B Model Performance

  5. Transform Your RAG Apps with Voyage AI’s Model and API

& so much more!

Read time: 3 mins

Latest Developments 🌍

75M Parameter Diffusion Model Excels in Coding

Researchers at Microsoft have introduced CodeFusion, a new pre-trained diffusion model for code generation that overcomes limitations in existing auto-regressive models. The model can produce diverse and accurate code across multiple programming languages.

Key Highlights:

  • CodeFusion combines an encoder-decoder architecture with a diffusion process, enabling iterative denoising of complete code programs conditioned on natural language input.

  • CodeFusion performs competitively in top-1 accuracy with much larger auto-regressive models across Python, Bash, and Excel conditional formatting (CF) rules. It outperforms them in top-3 and top-5 accuracy, indicating improved diversity in generated code.

  • The 75M parameter model yields significantly more syntactically valid code generations compared to diffusion models not initially designed for code generation, with an increase of 33.8% over Diffusion-LM and 26.2% over GENIE across the evaluated languages.

It’s interesting to find that GPT 3.5 has 20B parameters only, much smaller than anyone could’ve thought!

Open Dataset with 30T Tokens for Training LLMs 🐏

Together AI releases the second version of RedPajama Dataset with 30 trillion tokens, marking a significant stride in training LLMs. The dataset is filtered and deduplicated from 100+ trillion raw tokens across 84 CommonCrawl dumps, covering five languages - English, French, Spanish, German, and Italian.

Key Highlights:

  • RedPajama-V2 provides 40+ quality annotations, including natural language indicators, repetitive text signals, content-based quality signals, ML-based quality signals, and deduplication signals using Minhash signatures.

  • Despite reducing the token count by 60%, the dataset's deduplication process results in a disproportionate 71% decrease in the number of documents, indicating that the tail documents tend to be shorter.

  • The dataset processing involves the utilization of the CCNet pipeline, prioritizing the preservation of raw data information and resulting in the creation of 113 billion individual text documents in the five targeted languages.

Biden’s landmark Executive Order on AI Safety

In a move to bolster AI safety and security, U.S. President Biden has issued an executive order to establish new standards for AI safety and security, including requirements for companies to share results of safety tests before deploying AI models to the public.

Key Highlights:

  • The order aims to align AI safety and security standards with the Defense Production Act and targets any foundation model that might pose a risk to national security, economic security, or public health.

  • The National Institute of Standards and Technology (NIST) is tasked with developing new standards for extensive red-team testing before AI release.

  • While the order provides guidelines for AI developers, it's unclear how enforceable it is without further legislative changes. The order calls on Congress to pass bipartisan data privacy legislation to protect Americans' data.

U.S. President Joe Biden

The AI Model That Understands You Better

Hugging Face team releases Zephyr-7B, a compact language model designed to align with user intent. Developed through distilled direct preference optimization (dDPO) without the need for human annotation, Zephyr-7B sets a new industry standard for 7B parameter chat models, and performs comparably with even 70B parameter models including Llama 2-Chat.

The research, focusing on aligning smaller open language models with user intent, highlights the significance of AI Feedback (AIF) data in achieving the model's precise alignment. By leveraging the dDPO method, Zephyr-7B demonstrates its effectiveness in enhancing conversational capabilities and academic task performance.

Tools of the Trade ⚒️

  • Voyage AI: Specializes in creating customized vectorization models for improved retrieval quality, particularly enhancing RAG systems, minimizing hallucination, modularity, and adaptability across various industries.

  • FocusFusion: Productivity tool that automatically rates app usage, provides detailed activity reports, sets daily limits, and sends real-time alerts to help you optimize your workflow and minimize distractions.

  • ConvertMate: AI-powered platform that optimizes product descriptions and continuously improves all aspects of your product page for enhanced conversion rates, revenue growth and time savings.

  • Frequentli: Quickly generate accurate FAQs for websites with AI that utilizes content of the website or uploaded documentation, in just a few simple steps.

  • SightX’s Ada: A generative AI consultant that streamlines market research by providing custom-tailored surveys, recommendations on sample sizes, and generating marketing assets, all fueled by AI-driven insights.

😍 Enjoying so far, TWEET NOW to share with your friends!

Hot Takes 🔥

  • Announcement: I'll be investing $3B into Anthropic. It seems like the thing to do ~ Naveen Rao

  • Let me get something straight. The folks who have been worried about AI safety consistently since 2015 -- 3 years before GPT and 7 years before ChatGPT -- have been using it this whole time as a tool for regulatory capture? ~ Mark Chen

  • The AI Executive Order is a bit ridiculous and pretty hard to enforce. ~ Bindu Reddy

Meme of the Day 🤡

r/programmingmemes - C++ and Javascript

That’s all for today!

See you tomorrow with more such AI-filled content. Don’t forget to subscribe and give your feedback below 👇

Real-time AI Updates 🚨

⚡️ Follow me on Twitter @Saboo_Shubham for lightning-fast AI updates and never miss what’s trending!!

PS: I curate this AI newsletter every day for FREE, your support is what keeps me going. If you find value in what you read, share it with your friends by clicking the share button below!

Reply

or to participate.