Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
ftajwar 's Collections
MaxRL
Paprika
Self-Rewarding-LLM-Training

MaxRL

updated 1 day ago

Qwen3-Base post-trained checkpoints for our paper, Maximum Likelihood Reinforcement Learning [https://zanette-labs.github.io/MaxRL/]

Upvote
1

  • ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps

    Text Generation • 4B • Updated 1 day ago • 19

  • ftajwar/qwen3_4B_Base_GRPO_Polaris_1000_steps

    Text Generation • 4B • Updated 1 day ago • 19

  • ftajwar/qwen3_1.7B_Base_MaxRL_Polaris_1000_steps

    Text Generation • 2B • Updated 1 day ago • 14

  • ftajwar/qwen3_1.7B_Base_GRPO_Polaris_1000_steps

    Text Generation • 2B • Updated 1 day ago • 15
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs