| # Model Placeholder | |
| This repository is ready to host optimized model variants for the Unicorn Execution Engine. | |
| ## Planned Model Files: | |
| ### Gemma 3n E2B Variants | |
| - `gemma3n-e2b-fp16-npu.safetensors` (MatFormer FP16 optimized) | |
| - `gemma3n-e2b-int8-npu.safetensors` (MatFormer INT8 quantized) | |
| - `gemma3n-e2b-config.json` (Model configuration) | |
| - `gemma3n-e2b-tokenizer.json` (Tokenizer configuration) | |
| ### Qwen2.5-7B Variants | |
| - `qwen25-7b-fp16-hybrid.safetensors` (Hybrid execution FP16) | |
| - `qwen25-7b-int8-hybrid.safetensors` (Hybrid execution INT8) | |
| - `qwen25-7b-config.json` (Model configuration) | |
| - `qwen25-7b-tokenizer.json` (Tokenizer configuration) | |
| ### NPU Optimization Files | |
| - `npu_attention_kernels.mlir` (MLIR-AIE kernels) | |
| - `igpu_optimization_configs.json` (ROCm configurations) | |
| - `performance_profiles.json` (Turbo mode profiles) | |
| ## Model Sizes (Estimated) | |
| - **Gemma 3n E2B FP16**: ~4GB | |
| - **Gemma 3n E2B INT8**: ~2GB | |
| - **Qwen2.5-7B FP16**: ~14GB | |
| - **Qwen2.5-7B INT8**: ~7GB | |
| ## Performance Targets | |
| - **Gemma 3n E2B**: 100+ TPS with turbo mode | |
| - **Qwen2.5-7B**: 60+ TPS with hybrid execution | |
| - **Memory Usage**: <10GB total system budget | |
| - **Latency**: <30ms time to first token | |
| To create actual optimized models, run the Unicorn Execution Engine quantization pipeline: | |
| ```bash | |
| cd Unicorn-Execution-Engine | |
| python quantization_engine.py --model gemma3n-e2b --precision fp16 --target npu | |
| python quantization_engine.py --model qwen25-7b --precision int8 --target hybrid | |
| ``` |