Nemotron-H-4B-Instruct-128K
Model Developer: NVIDIA
Model Dates:
October 2024 - March 2025
Data Freshness:
September 2024
The pretraining data has a cutoff date of September 2024.
Model Overview
NVIDIA Nemotron-H-4B-Instruct-128K is a large language model (LLM) developed by NVIDIA, optimized for single and multi-turn chat, instruction following, and tool-calling use-cases. It uses a hybrid model architecture that consists primarily of Mamba-2 and MLP layers combined with just four Attention layers. The model is an aligned version of Nemotron-H-4B-Base-8K, and features a 128K context length. The supported languages include: English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, and Chinese.
The model underwent a multi-phase post-training process including multiple supervised fine-tuning stages for math, code, science, and then chat, instruction following, and tool-calling, followed by multiple preference tuning stages using Reward-aware Preference Optimization (RPO) for both chat and instruction-following.
The base model was pruned and distilled from Nemotron-H-Base-8K using our hybrid language model compression technique. For more details, please refer to the paper.
The paper has been accepted for publication at NeurIPS 2025.
This model is for research and development only.
License/Terms of Use
GOVERNING TERMS: Use of this model is governed by the NVIDIA Internal Scientific Research and Development Model License
Model Architecture
- Architecture Type: Transformer
- Network Architecture: Nemotron-Hybrid
This model has 4B of model parameters.
Deployment Geography: Global
Use Case: This model is intended for developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. This model is also suitable for typical instruction-following tasks.
Release Date:
Huggingface: 10/23/2025 via https://huggingface.co/
Input
- Input Type(s): Text
- Input Format(s): String
- Input Parameters: One-Dimensional (1D): Sequences
- Other Properties Related to Input: Context length up to 128K. Supported languages include German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English.
Output
- Output Type(s): Text
- Output Format: String
- Output Parameters: One-Dimensional (1D): Sequences
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
- Runtime Engine(s): NeMo 24.12
- Supported Hardware Microarchitecture Compatibility: NVIDIA H100-80GB, NVIDIA A100
- Operating System(s): Linux
Model Version
- v1.0
References
[2504.11409] Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning
Prompt Format
<SPECIAL_10>System\n{system prompt}\n<SPECIAL_11>User\n{user prompt}\n<SPECIAL_11>Assistant\n
Note: Newline should be present after the last Assistant as a generation prompt.
Training, Testing, and Evaluation Datasets
The data for post-training phases is a compilation of supervised fine-tuning and preference tuning data for improving math, code, science, chat, tool-calling, and instruction following capabilities.
Data Collection for Training & Testing Datasets: Hybrid: Automated, Human, Synthetic
Data Labeling for Training & Testing Datasets: Hybrid: Automated, Human, Synthetic
Evaluation Datasets
We used the datasets listed in the next section to evaluate the model.
Data Collection for Training Datasets: Hybrid: Automated, Human, Synthetic
Data Labeling for Training Datasets: Hybrid: Automated, Human, Synthetic
Chat & Instruction Following Evaluations:
| MT-Bench 0-shot | IFEval Strict Average 0-shot |
|---|---|
| 7.9 | 76.24 |
MT-Bench - A set of 80 multi-turn, open-ended questions for evaluating chat abilities. We use GPT-4-Turbo as the judge model. Dataset & Code
IFEval - Contains 500 verifiable instructions to test instruction following abilities of language models. We report the average of prompt and instruction level scores in the strict category. Dataset
Prompt:
<SPECIAL_10>System
<SPECIAL_11>User
{question}
<SPECIAL_11>Assistant
Coding Evaluations:
| MBPP 0-shot | MBPP+ 0-shot | HumanEval 0-shot | HumanEval+ 0-shot |
|---|---|---|---|
| 78.6 | 68.25 | 76.2 | 70.85 |
MBPP - Evaluates ability to generate solutions for Python programming tasks. Dataset
MBPP+ - Extended version of MBPP with additional tests. Dataset
HumanEval - Tests code generation and completion abilities in Python. Dataset
HumanEval+ - Extended version of HumanEval with additional tests. Dataset
Prompt:
<SPECIAL_10>System
<SPECIAL_11>User
You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.
@@ Instruction
Here is the given problem and test examples:
{question}
Please use the python programming language to solve this problem.
Please make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples.
Please return all completed codes in one code block.
This code block should be in the following format:
```python
# Your codes here```
<SPECIAL_11>Assistant
Math Evaluations:
| GSM8K 0-shot | MATH-500 0-shot |
|---|---|
| 88.93 | 76.4 |
GSM8K - Evaluates grade school level mathematical word problem solving. Dataset
MATH-500 - A subset of 500 questions from the MATH benchmark. Dataset
Prompt:
<SPECIAL_10>System
<SPECIAL_11>User
Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.
Question: {question}
<SPECIAL_11>Assistant
Tool-Calling Evaluations:
| BFCL v2 Live Overall Accuracy 0-shot |
|---|
| 65.88 |
BFCL v2 Live - Evaluates tool-calling ability of language models over multiple categories in real-world scenarios. Dataset
Prompt:
<SPECIAL_10>System
<AVAILABLE_TOOLS>[{"name": "func_name1", "description": "func_desc1", "parameters": {"type": "dict", "required": ["param1"], "properties": {"param1": {"type": "param_type1", "description": "param_desc1", "default": "default_value1"}}}}, {"name": "func_name2",...]</AVAILABLE_TOOLS>
<SPECIAL_11>User
{question}
<SPECIAL_11>Assistant
Tool Call Response Format:
<TOOLCALL>[{{"name": "func_name1", "arguments": {{"params_name1": "params_value1", "params_name2": "params_value2"}}}}, {{"name": "func_name2", "arguments": {{"params_name1": "params_value1", "params_name2": "params_value2"}}}}]</TOOLCALL>
General Evaluations:
| MMLU 0-shot (Generative) |
|---|
| 66.96 |
MMLU - Tests knowledge across 57 subjects including science, humanities, math and more. Dataset
Prompt:
<SPECIAL_10>System
<SPECIAL_11>User
Below is a multi-choice question about {subject}. You must reply with only a single letter (either A, B, C or D).
Question: {question}
<SPECIAL_11>Assistant
Potential Known Risks for Usage
The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. Code produced by the model may not always model real-world contexts and should be checked. The model demonstrates weakness to alignment-breaking attacks. Users are advised to deploy language model guardrails alongside this model to prevent potentially harmful outputs. The model may generate answers that are inaccurate, omit key information, or include irrelevant or redundant text.
Inference
- Engine: NeMo
- Test Hardware NVIDIA H100-80GB
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Responsible Use Guide available at http://nvidia.com/nemotron-responsible-use.
Please report security vulnerabilities or NVIDIA AI Concerns here.
Example
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-H-4B-Instruct-128K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-H-4B-Instruct-128K", torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
# Use the prompt template
messages = [
{"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(tokenized_chat)
print(tokenizer.decode(outputs[0]))
- Downloads last month
- 1,100