Mellow Gemma 3 1B - Reasoning

A Gemma 3 1B model fine-tuned with GRPO for mathematical reasoning on the GSM8K dataset. The model generates explicit step-by-step reasoning before providing final answers.

Training

Base Model: Gemma 3 1B Instruct (4-bit)
Method: GRPO (Group Relative Policy Optimization)
Dataset: OpenAI GSM8K
LoRA Config: r=8, alpha=8, targeting attention and MLP layers

Training used multiple reward functions to enforce structured output format and answer accuracy.

Output Format

<start_working_out>
[Step-by-step reasoning]
<end_working_out>
<SOLUTION>[Answer]</SOLUTION>

Usage

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="colesmcintosh/mellow-gemma-3-reasoning",
    max_seq_length=16000,
)

system_prompt = """You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "If 5 apples cost $10, how much do 8 apples cost?"}
]

text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = model.generate(**tokenizer(text, return_tensors="pt").to("cuda"), max_new_tokens=256)

License

Gemma License - see terms


Trained with Unsloth

Downloads last month
16
Safetensors
Model size
1.0B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stay-mellow-ai/gemma-3-1b-reasoning

Finetuned
(261)
this model

Dataset used to train stay-mellow-ai/gemma-3-1b-reasoning