Mellow Gemma 3 1B - Reasoning
A Gemma 3 1B model fine-tuned with GRPO for mathematical reasoning on the GSM8K dataset. The model generates explicit step-by-step reasoning before providing final answers.
Training
Base Model: Gemma 3 1B Instruct (4-bit)
Method: GRPO (Group Relative Policy Optimization)
Dataset: OpenAI GSM8K
LoRA Config: r=8, alpha=8, targeting attention and MLP layers  
Training used multiple reward functions to enforce structured output format and answer accuracy.
Output Format
<start_working_out>
[Step-by-step reasoning]
<end_working_out>
<SOLUTION>[Answer]</SOLUTION>
Usage
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="colesmcintosh/mellow-gemma-3-reasoning",
    max_seq_length=16000,
)
system_prompt = """You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "If 5 apples cost $10, how much do 8 apples cost?"}
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = model.generate(**tokenizer(text, return_tensors="pt").to("cuda"), max_new_tokens=256)
License
Gemma License - see terms
Trained with Unsloth
- Downloads last month
- 16
Model tree for stay-mellow-ai/gemma-3-1b-reasoning
Base model
google/gemma-3-1b-pt
				Finetuned
	
	
google/gemma-3-1b-it
						
				Quantized
	
	
unsloth/gemma-3-1b-it-unsloth-bnb-4bit
						