pytorch
/

gemma-3-12b-it-FP8

@@ -187,8 +187,8 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
 | Benchmark                        |                |                           |
 |----------------------------------|----------------|---------------------------|
-|                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-FP8         |
-| mmlu                             | To be filled   | To be filled                      |
 <details>
@@ -204,7 +204,7 @@ lm_eval --model hf --model_args pretrained=google/gemma-3-12b-it --tasks mmlu --
 ## FP8
 ```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-FP8
 lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 --batch_size 8
 ```
 </details>
@@ -218,8 +218,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
-|                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-FP8              |
-| Peak Memory (GB) | To be filled   | To be filled (?% reduction)    |
@@ -279,7 +279,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-FP8        |
-| latency (batch_size=1)           | ?s          | ?s (?x speedup)    |
 <details>
 <summary> Reproduce Model Performance Results </summary>
@@ -311,48 +312,6 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
 export MODEL=jerryzh168/gemma-3-12b-it-FP8
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
-## benchmark_serving
-We benchmarked the throughput in a serving environment.
-Download sharegpt dataset:
-```Shell
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-```
-Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
-Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
-### baseline
-Server:
-```Shell
-export MODEL=google/gemma-3-12b-it
-vllm serve $MODEL --tokenizer $MODEL -O3
-```
-Client:
-```Shell
-export MODEL=google/gemma-3-12b-it
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
-```
-### FP8
-Server:
-```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-FP8
-VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer $MODEL -O3 --pt-load-map-location cuda:0
-```
-Client:
-```Shell
-export MODEL=jerryzh168/gemma-3-12b-it-FP8
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
-```
 </details>

 | Benchmark                        |                |                           |
 |----------------------------------|----------------|---------------------------|
+|                                  | google/gemma-3-12b-it   | pytorch/gemma-3-12b-it-FP8         |
+| mmlu                             | 71.51   | 71.30                      |
 <details>
 ## FP8
 ```Shell
+export MODEL=pytorch/gemma-3-12b-it-FP8
 lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 --batch_size 8
 ```
 </details>
 | Benchmark        |                |                                |
 |------------------|----------------|--------------------------------|
+|                  | google/gemma-3-12b-it   | pytorch/gemma-3-12b-it-FP8              |
+| Peak Memory (GB) | 24.50   | 15.47 (37% reduction)    |
 | Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | google/gemma-3-12b-it   | jerryzh168/gemma-3-12b-it-FP8        |
+| latency (batch_size=1)           | 3.73s          | 2.76s (1.35x speedup)    |
+| latency (batch_size=256)           | 13.63s          | 11.49s (1.19x speedup)    |
 <details>
 <summary> Reproduce Model Performance Results </summary>
 export MODEL=jerryzh168/gemma-3-12b-it-FP8
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
 </details>