Nemotron-H-4B-Instruct-128K

Model Developer: NVIDIA

Model Dates:

October 2024 - March 2025

Data Freshness:

September 2024

The pretraining data has a cutoff date of September 2024.

Model Overview

NVIDIA Nemotron-H-4B-Instruct-128K is a large language model (LLM) developed by NVIDIA, optimized for single and multi-turn chat, instruction following, and tool-calling use-cases. It uses a hybrid model architecture that consists primarily of Mamba-2 and MLP layers combined with just four Attention layers. The model is an aligned version of Nemotron-H-4B-Base-8K, and features a 128K context length. The supported languages include: English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, and Chinese.

The model underwent a multi-phase post-training process including multiple supervised fine-tuning stages for math, code, science, and then chat, instruction following, and tool-calling, followed by multiple preference tuning stages using Reward-aware Preference Optimization (RPO) for both chat and instruction-following.

The base model was pruned and distilled from Nemotron-H-Base-8K using our hybrid language model compression technique. For more details, please refer to the paper.

The paper has been accepted for publication at NeurIPS 2025.

This model is for research and development only.

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by the NVIDIA Internal Scientific Research and Development Model License

Model Architecture

Architecture Type: Transformer
Network Architecture: Nemotron-Hybrid

This model has 4B of model parameters.

Deployment Geography: Global

Use Case: This model is intended for developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. This model is also suitable for typical instruction-following tasks.

Release Date:

Huggingface: 10/23/2025 via https://huggingface.co/

Input

Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D): Sequences
Other Properties Related to Input: Context length up to 128K. Supported languages include German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English.

Output

Output Type(s): Text
Output Format: String
Output Parameters: One-Dimensional (1D): Sequences

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s): NeMo 24.12
Supported Hardware Microarchitecture Compatibility: NVIDIA H100-80GB, NVIDIA A100
Operating System(s): Linux

Model Version

v1.0

References

[2504.11409] Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

Prompt Format

<SPECIAL_10>System\n{system prompt}\n<SPECIAL_11>User\n{user prompt}\n<SPECIAL_11>Assistant\n

Note: Newline should be present after the last Assistant as a generation prompt.

Training, Testing, and Evaluation Datasets

The data for post-training phases is a compilation of supervised fine-tuning and preference tuning data for improving math, code, science, chat, tool-calling, and instruction following capabilities.

Data Collection for Training & Testing Datasets: Hybrid: Automated, Human, Synthetic

Data Labeling for Training & Testing Datasets: Hybrid: Automated, Human, Synthetic

Evaluation Datasets

We used the datasets listed in the next section to evaluate the model.

Data Collection for Training Datasets: Hybrid: Automated, Human, Synthetic

Data Labeling for Training Datasets: Hybrid: Automated, Human, Synthetic

Chat & Instruction Following Evaluations:

MT-Bench 0-shot	IFEval Strict Average 0-shot
7.9	76.24

MT-Bench - A set of 80 multi-turn, open-ended questions for evaluating chat abilities. We use GPT-4-Turbo as the judge model. Dataset & Code

IFEval - Contains 500 verifiable instructions to test instruction following abilities of language models. We report the average of prompt and instruction level scores in the strict category. Dataset

Prompt:

<SPECIAL_10>System

<SPECIAL_11>User
{question}
<SPECIAL_11>Assistant

Coding Evaluations:

MBPP 0-shot	MBPP+ 0-shot	HumanEval 0-shot	HumanEval+ 0-shot
78.6	68.25	76.2	70.85

MBPP - Evaluates ability to generate solutions for Python programming tasks. Dataset

MBPP+ - Extended version of MBPP with additional tests. Dataset

HumanEval - Tests code generation and completion abilities in Python. Dataset

HumanEval+ - Extended version of HumanEval with additional tests. Dataset

Prompt:

<SPECIAL_10>System

<SPECIAL_11>User
You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Here is the given problem and test examples:
{question}
Please use the python programming language to solve this problem.
Please make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples.
Please return all completed codes in one code block.
This code block should be in the following format:
```python
# Your codes here```
<SPECIAL_11>Assistant

Math Evaluations:

GSM8K 0-shot	MATH-500 0-shot
88.93	76.4

GSM8K - Evaluates grade school level mathematical word problem solving. Dataset

MATH-500 - A subset of 500 questions from the MATH benchmark. Dataset

Prompt:

<SPECIAL_10>System

<SPECIAL_11>User
Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.
Question: {question}
<SPECIAL_11>Assistant

Tool-Calling Evaluations:

BFCL v2 Live Overall Accuracy 0-shot
65.88

BFCL v2 Live - Evaluates tool-calling ability of language models over multiple categories in real-world scenarios. Dataset

Prompt:

<SPECIAL_10>System
<AVAILABLE_TOOLS>[{"name": "func_name1", "description": "func_desc1", "parameters": {"type": "dict", "required": ["param1"], "properties": {"param1": {"type": "param_type1", "description": "param_desc1", "default": "default_value1"}}}}, {"name": "func_name2",...]</AVAILABLE_TOOLS>
<SPECIAL_11>User
{question}
<SPECIAL_11>Assistant

Tool Call Response Format:

<TOOLCALL>[{{"name": "func_name1", "arguments": {{"params_name1": "params_value1", "params_name2": "params_value2"}}}}, {{"name": "func_name2", "arguments": {{"params_name1": "params_value1", "params_name2": "params_value2"}}}}]</TOOLCALL>

General Evaluations:

MMLU 0-shot (Generative)
66.96

MMLU - Tests knowledge across 57 subjects including science, humanities, math and more. Dataset

Prompt:

<SPECIAL_10>System

<SPECIAL_11>User
Below is a multi-choice question about {subject}. You must reply with only a single letter (either A, B, C or D).

Question: {question}
<SPECIAL_11>Assistant

Potential Known Risks for Usage

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. Code produced by the model may not always model real-world contexts and should be checked. The model demonstrates weakness to alignment-breaking attacks. Users are advised to deploy language model guardrails alongside this model to prevent potentially harmful outputs. The model may generate answers that are inaccurate, omit key information, or include irrelevant or redundant text.

Inference

Engine: NeMo
Test Hardware NVIDIA H100-80GB

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Responsible Use Guide available at http://nvidia.com/nemotron-responsible-use.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Example

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer  = AutoTokenizer.from_pretrained("nvidia/Nemotron-H-4B-Instruct-128K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-H-4B-Instruct-128K", torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()

# Use the prompt template
messages = [
    {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(tokenized_chat)
print(tokenizer.decode(outputs[0]))