|  | --- | 
					
						
						|  | language: | 
					
						
						|  | - en | 
					
						
						|  | base_model: | 
					
						
						|  | - openai/gpt-oss-20b | 
					
						
						|  | pipeline_tag: text-generation | 
					
						
						|  | tags: | 
					
						
						|  | - gpt_oss | 
					
						
						|  | - vllm | 
					
						
						|  | - conversational | 
					
						
						|  | - text-generation-inference | 
					
						
						|  | - 8-bit precision | 
					
						
						|  | - mxfp4 | 
					
						
						|  | license: apache-2.0 | 
					
						
						|  | license_name: apache-2.0 | 
					
						
						|  | name: RedHatAI/gpt-oss-20b | 
					
						
						|  | description: This model is the smaller version of the gpt-oss series, designed for lower latency and local or specialized use cases. | 
					
						
						|  | readme: https://huggingface.co/RedHatAI/gpt-oss-20b/main/README.md | 
					
						
						|  | tasks: | 
					
						
						|  | - text-to-text | 
					
						
						|  | - text-generation | 
					
						
						|  | provider: OpenAI | 
					
						
						|  | license_link: https://www.apache.org/licenses/LICENSE-2.0 | 
					
						
						|  | validated_on: | 
					
						
						|  | - RHOAI 2.25 | 
					
						
						|  | - RHAIIS 3.2.2 | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;"> | 
					
						
						|  | gpt-oss-20b | 
					
						
						|  | <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" /> | 
					
						
						|  | </h1> | 
					
						
						|  |  | 
					
						
						|  | <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;"> | 
					
						
						|  | <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" /> | 
					
						
						|  | </a> | 
					
						
						|  |  | 
					
						
						|  | <p> | 
					
						
						|  | <a href="https://gpt-oss.com"><strong>Try gpt-oss</strong></a> · | 
					
						
						|  | <a href="https://cookbook.openai.com/topic/gpt-oss"><strong>Guides</strong></a> · | 
					
						
						|  | <a href="https://arxiv.org/abs/2508.10925"><strong>Model card</strong></a> · | 
					
						
						|  | <a href="https://openai.com/index/introducing-gpt-oss/"><strong>OpenAI blog</strong></a> | 
					
						
						|  | </p> | 
					
						
						|  |  | 
					
						
						|  | Welcome to the gpt-oss series, [OpenAI’s open-weight models](https://openai.com/open-models) designed for powerful reasoning, agentic tasks, and versatile developer use cases. | 
					
						
						|  |  | 
					
						
						|  | We’re releasing two flavors of these open models: | 
					
						
						|  | - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 or AMD MI300X) (117B parameters with 5.1B active parameters) | 
					
						
						|  | - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) | 
					
						
						|  |  | 
					
						
						|  | Both models were trained on our [harmony response format](https://github.com/openai/harmony) and should only be used with the harmony format as it will not work correctly otherwise. | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | > [!NOTE] | 
					
						
						|  | > This model card is dedicated to the smaller `gpt-oss-20b` model. Check out [`gpt-oss-120b`](https://huggingface.co/RedHatAI/gpt-oss-120b) for the larger model. | 
					
						
						|  |  | 
					
						
						|  | # Highlights | 
					
						
						|  |  | 
					
						
						|  | * **Permissive Apache 2.0 license:** Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. | 
					
						
						|  | * **Configurable reasoning effort:** Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. | 
					
						
						|  | * **Full chain-of-thought:** Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. | 
					
						
						|  | * **Fine-tunable:** Fully customize models to your specific use case through parameter fine-tuning. | 
					
						
						|  | * **Agentic capabilities:** Use the models’ native capabilities for function calling, [web browsing](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#browser), [Python code execution](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#python), and Structured Outputs. | 
					
						
						|  | * **MXFP4 quantization:** The models were post-trained with MXFP4 quantization of the MoE weights, making `gpt-oss-120b` run on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and the `gpt-oss-20b` model run within 16GB of memory. All evals were performed with the same MXFP4 quantization. | 
					
						
						|  |  | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | # Inference examples | 
					
						
						|  |  | 
					
						
						|  | ## Transformers | 
					
						
						|  |  | 
					
						
						|  | You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the [harmony response format](https://github.com/openai/harmony). If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [openai-harmony](https://github.com/openai/harmony) package. | 
					
						
						|  |  | 
					
						
						|  | To get started, install the necessary dependencies to setup your environment: | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | pip install -U transformers kernels torch | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Once, setup you can proceed to run the model by running the snippet below: | 
					
						
						|  |  | 
					
						
						|  | ```py | 
					
						
						|  | from transformers import pipeline | 
					
						
						|  | import torch | 
					
						
						|  |  | 
					
						
						|  | model_id = "openai/gpt-oss-20b" | 
					
						
						|  |  | 
					
						
						|  | pipe = pipeline( | 
					
						
						|  | "text-generation", | 
					
						
						|  | model=model_id, | 
					
						
						|  | torch_dtype="auto", | 
					
						
						|  | device_map="auto", | 
					
						
						|  | ) | 
					
						
						|  |  | 
					
						
						|  | messages = [ | 
					
						
						|  | {"role": "user", "content": "Explain quantum mechanics clearly and concisely."}, | 
					
						
						|  | ] | 
					
						
						|  |  | 
					
						
						|  | outputs = pipe( | 
					
						
						|  | messages, | 
					
						
						|  | max_new_tokens=256, | 
					
						
						|  | ) | 
					
						
						|  | print(outputs[0]["generated_text"][-1]) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Alternatively, you can run the model via [`Transformers Serve`](https://huggingface.co/docs/transformers/main/serving) to spin up a OpenAI-compatible webserver: | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  | transformers serve | 
					
						
						|  | transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | [Learn more about how to use gpt-oss with Transformers.](https://cookbook.openai.com/articles/gpt-oss/run-transformers) | 
					
						
						|  |  | 
					
						
						|  | ## vLLM | 
					
						
						|  |  | 
					
						
						|  | vLLM recommends using [uv](https://docs.astral.sh/uv/) for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server. | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | uv pip install --pre vllm==0.10.1+gptoss \ | 
					
						
						|  | --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ | 
					
						
						|  | --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \ | 
					
						
						|  | --index-strategy unsafe-best-match | 
					
						
						|  |  | 
					
						
						|  | vllm serve openai/gpt-oss-20b | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | [Learn more about how to use gpt-oss with vLLM.](https://cookbook.openai.com/articles/gpt-oss/run-vllm) | 
					
						
						|  |  | 
					
						
						|  | <details> | 
					
						
						|  | <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary> | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \ | 
					
						
						|  | --ipc=host \ | 
					
						
						|  | --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ | 
					
						
						|  | --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \ | 
					
						
						|  | --name=vllm \ | 
					
						
						|  | registry.access.redhat.com/rhaiis/rh-vllm-cuda \ | 
					
						
						|  | vllm serve \ | 
					
						
						|  | --tensor-parallel-size 8 \ | 
					
						
						|  | --max-model-len 32768  \ | 
					
						
						|  | --enforce-eager --model RedHatAI/gpt-oss-20b | 
					
						
						|  | ``` | 
					
						
						|  | </details> | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | <details> | 
					
						
						|  | <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary> | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | # Setting up vllm server with ServingRuntime | 
					
						
						|  | # Save as: vllm-servingruntime.yaml | 
					
						
						|  | apiVersion: serving.kserve.io/v1alpha1 | 
					
						
						|  | kind: ServingRuntime | 
					
						
						|  | metadata: | 
					
						
						|  | name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name | 
					
						
						|  | annotations: | 
					
						
						|  | openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe | 
					
						
						|  | opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' | 
					
						
						|  | labels: | 
					
						
						|  | opendatahub.io/dashboard: 'true' | 
					
						
						|  | spec: | 
					
						
						|  | annotations: | 
					
						
						|  | prometheus.io/port: '8080' | 
					
						
						|  | prometheus.io/path: '/metrics' | 
					
						
						|  | multiModel: false | 
					
						
						|  | supportedModelFormats: | 
					
						
						|  | - autoSelect: true | 
					
						
						|  | name: vLLM | 
					
						
						|  | containers: | 
					
						
						|  | - name: kserve-container | 
					
						
						|  | image: quay.io/modh/vllm:rhoai-2.25-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.25-rocm | 
					
						
						|  | command: | 
					
						
						|  | - python | 
					
						
						|  | - -m | 
					
						
						|  | - vllm.entrypoints.openai.api_server | 
					
						
						|  | args: | 
					
						
						|  | - "--port=8080" | 
					
						
						|  | - "--model=/mnt/models" | 
					
						
						|  | - "--served-model-name={{.Name}}" | 
					
						
						|  | env: | 
					
						
						|  | - name: HF_HOME | 
					
						
						|  | value: /tmp/hf_home | 
					
						
						|  | ports: | 
					
						
						|  | - containerPort: 8080 | 
					
						
						|  | protocol: TCP | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | # Attach model to vllm server. This is an NVIDIA template | 
					
						
						|  | # Save as: inferenceservice.yaml | 
					
						
						|  | apiVersion: serving.kserve.io/v1beta1 | 
					
						
						|  | kind: InferenceService | 
					
						
						|  | metadata: | 
					
						
						|  | annotations: | 
					
						
						|  | openshift.io/display-name: gpt-oss-20b # OPTIONAL CHANGE | 
					
						
						|  | serving.kserve.io/deploymentMode: RawDeployment | 
					
						
						|  | name: gpt-oss-20b         # specify model name. This value will be used to invoke the model in the payload | 
					
						
						|  | labels: | 
					
						
						|  | opendatahub.io/dashboard: 'true' | 
					
						
						|  | spec: | 
					
						
						|  | predictor: | 
					
						
						|  | maxReplicas: 1 | 
					
						
						|  | minReplicas: 1 | 
					
						
						|  | model: | 
					
						
						|  | modelFormat: | 
					
						
						|  | name: vLLM | 
					
						
						|  | name: '' | 
					
						
						|  | resources: | 
					
						
						|  | limits: | 
					
						
						|  | cpu: '2'			# this is model specific | 
					
						
						|  | memory: 8Gi		# this is model specific | 
					
						
						|  | nvidia.com/gpu: '1'	# this is accelerator specific | 
					
						
						|  | requests:			# same comment for this block | 
					
						
						|  | cpu: '1' | 
					
						
						|  | memory: 4Gi | 
					
						
						|  | nvidia.com/gpu: '1' | 
					
						
						|  | runtime: vllm-cuda-runtime	# must match the ServingRuntime name above | 
					
						
						|  | storageUri: oci://registry.redhat.io/rhelai1/gpt-oss-20b:1.5 | 
					
						
						|  | tolerations: | 
					
						
						|  | - effect: NoSchedule | 
					
						
						|  | key: nvidia.com/gpu | 
					
						
						|  | operator: Exists | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | # make sure first to be in the project where you want to deploy the model | 
					
						
						|  | # oc project <project-name> | 
					
						
						|  |  | 
					
						
						|  | # apply both resources to run model | 
					
						
						|  |  | 
					
						
						|  | # Apply the ServingRuntime | 
					
						
						|  | oc apply -f vllm-servingruntime.yaml | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | # Replace <inference-service-name> and <cluster-ingress-domain> below: | 
					
						
						|  | # - Run `oc get inferenceservice` to find your URL if unsure. | 
					
						
						|  |  | 
					
						
						|  | # Call the server using curl: | 
					
						
						|  | curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions | 
					
						
						|  | -H "Content-Type: application/json" \ | 
					
						
						|  | -d '{ | 
					
						
						|  | "model": "gpt-oss-20b", | 
					
						
						|  | "stream": true, | 
					
						
						|  | "stream_options": { | 
					
						
						|  | "include_usage": true | 
					
						
						|  | }, | 
					
						
						|  | "max_tokens": 1, | 
					
						
						|  | "messages": [ | 
					
						
						|  | { | 
					
						
						|  | "role": "user", | 
					
						
						|  | "content": "How can a bee fly when its wings are so small?" | 
					
						
						|  | } | 
					
						
						|  | ] | 
					
						
						|  | }' | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details. | 
					
						
						|  | </details> | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## PyTorch / Triton | 
					
						
						|  |  | 
					
						
						|  | To learn about how to use this model with PyTorch and Triton, check out our [reference implementations in the gpt-oss repository](https://github.com/openai/gpt-oss?tab=readme-ov-file#reference-pytorch-implementation). | 
					
						
						|  |  | 
					
						
						|  | ## Ollama | 
					
						
						|  |  | 
					
						
						|  | If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after [installing Ollama](https://ollama.com/download). | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | # gpt-oss-20b | 
					
						
						|  | ollama pull gpt-oss:20b | 
					
						
						|  | ollama run gpt-oss:20b | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | [Learn more about how to use gpt-oss with Ollama.](https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama) | 
					
						
						|  |  | 
					
						
						|  | #### LM Studio | 
					
						
						|  |  | 
					
						
						|  | If you are using [LM Studio](https://lmstudio.ai/) you can use the following commands to download. | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | # gpt-oss-20b | 
					
						
						|  | lms get openai/gpt-oss-20b | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Check out our [awesome list](https://github.com/openai/gpt-oss/blob/main/awesome-gpt-oss.md) for a broader collection of gpt-oss resources and inference partners. | 
					
						
						|  |  | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | # Download the model | 
					
						
						|  |  | 
					
						
						|  | You can download the model weights from the [Hugging Face Hub](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4) directly from Hugging Face CLI: | 
					
						
						|  |  | 
					
						
						|  | ```shell | 
					
						
						|  | # gpt-oss-20b | 
					
						
						|  | huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/ | 
					
						
						|  | pip install gpt-oss | 
					
						
						|  | python -m gpt_oss.chat model/ | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | # Reasoning levels | 
					
						
						|  |  | 
					
						
						|  | You can adjust the reasoning level that suits your task across three levels: | 
					
						
						|  |  | 
					
						
						|  | * **Low:** Fast responses for general dialogue. | 
					
						
						|  | * **Medium:** Balanced speed and detail. | 
					
						
						|  | * **High:** Deep and detailed analysis. | 
					
						
						|  |  | 
					
						
						|  | The reasoning level can be set in the system prompts, e.g., "Reasoning: high". | 
					
						
						|  |  | 
					
						
						|  | # Tool use | 
					
						
						|  |  | 
					
						
						|  | The gpt-oss models are excellent for: | 
					
						
						|  | * Web browsing (using built-in browsing tools) | 
					
						
						|  | * Function calling with defined schemas | 
					
						
						|  | * Agentic operations like browser tasks | 
					
						
						|  |  | 
					
						
						|  | # Fine-tuning | 
					
						
						|  |  | 
					
						
						|  | Both gpt-oss models can be fine-tuned for a variety of specialized use cases. | 
					
						
						|  |  | 
					
						
						|  | This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas the larger [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) can be fine-tuned on a single H100 node. | 
					
						
						|  |  | 
					
						
						|  | # Citation | 
					
						
						|  |  | 
					
						
						|  | ```bibtex | 
					
						
						|  | @misc{openai2025gptoss120bgptoss20bmodel, | 
					
						
						|  | title={gpt-oss-120b & gpt-oss-20b Model Card}, | 
					
						
						|  | author={OpenAI}, | 
					
						
						|  | year={2025}, | 
					
						
						|  | eprint={2508.10925}, | 
					
						
						|  | archivePrefix={arXiv}, | 
					
						
						|  | primaryClass={cs.CL}, | 
					
						
						|  | url={https://arxiv.org/abs/2508.10925}, | 
					
						
						|  | } | 
					
						
						|  | ``` |