Skip to content

The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.

Notifications You must be signed in to change notification settings

ksm26/Reinforcement-Fine-Tuning-LLMs-with-GRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Welcome to the Reinforcement Fine-Tuning LLMs with GRPO course, built in partnership with Predibase. This course offers a deep dive into how Group Relative Policy Optimization (GRPO) can be used to fine-tune LLMs for enhanced reasoning—even with small datasets.


📘 Course Summary

Fine-tune language models with reinforcement learning (RL) techniques that go beyond traditional supervised methods. Learn how to use GRPO, a powerful and efficient RL algorithm, to guide LLMs in tasks requiring multi-step reasoning like coding and math.

What You’ll Learn

  1. 🚦 Reinforcement Fine-Tuning (RFT): Understand when and why to use RFT over supervised fine-tuning—especially for strategy-based tasks.
  2. 🧮 GRPO Algorithm: Explore how GRPO leverages programmable reward functions and statistical techniques (advantage estimation, clipping, KL-penalty) to steer model behavior.
  3. 🧠 Reward Design: Craft reward functions to align model outputs with task goals and avoid reward hacking using penalty functions.
  4. 🤖 LLM-as-a-Judge: Learn how to evaluate subjective tasks using a language model to rate quality without human annotators.
  5. 🎮 Wordle with GRPO: See how the Wordle game is framed as a reinforcement learning task for LLMs to develop planning and feedback adaptation.
  6. 🚀 Train on Predibase: Learn how to launch and manage fine-tuning jobs using Predibase’s hosted RL infrastructure.

🔑 Key Points

  • 🧠 RFT enables strategic learning without requiring millions of examples.
  • 📊 GRPO is more scalable and deterministic than traditional RLHF.
  • ⚙️ Token-level control with KL-divergence and clipping enables stable training.
  • LLM-as-a-Judge offers scalable evaluation for subjective tasks.
  • 🛠️ Predibase platform support simplifies fine-tuning pipelines.

👨‍🏫 About the Instructors

Travis Addair
Co-Founder and CTO at Predibase

Arnav Garg
Senior Machine Learning Engineer at Predibase

Together, they bring deep expertise in large-scale model training, open-source infrastructure, and reinforcement learning applications in production.


🚀 Get started with reinforcement fine-tuning today:
📚 deeplearning.ai

About

The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published