add usage for llama.cpp and ollama
Browse files
README.md
CHANGED
|
@@ -20,6 +20,7 @@ library_name: transformers
|
|
| 20 |
</p>
|
| 21 |
|
| 22 |
## What's New
|
|
|
|
| 23 |
- [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. π₯π₯π₯
|
| 24 |
- [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).π₯π₯π₯
|
| 25 |
|
|
@@ -483,6 +484,37 @@ python3 -m cpmcu.cli \
|
|
| 483 |
|
| 484 |
For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
|
| 485 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 486 |
### Hybird Reasoning Mode
|
| 487 |
|
| 488 |
MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
|
|
@@ -518,8 +550,9 @@ prompt_text = tokenizer.apply_chat_template(
|
|
| 518 |
|
| 519 |
```bibtex
|
| 520 |
@article{minicpm4,
|
| 521 |
-
title={
|
| 522 |
-
author={MiniCPM Team},
|
|
|
|
| 523 |
year={2025}
|
| 524 |
}
|
| 525 |
```
|
|
|
|
| 20 |
</p>
|
| 21 |
|
| 22 |
## What's New
|
| 23 |
+
- [2025.09.29] **[InfLLM-V2 paper](https://arxiv.org/abs/2509.24663) is released!** We can train a sparse attention model with only 5B long-text tokens. π₯π₯π₯
|
| 24 |
- [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. π₯π₯π₯
|
| 25 |
- [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).π₯π₯π₯
|
| 26 |
|
|
|
|
| 484 |
|
| 485 |
For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
|
| 486 |
|
| 487 |
+
### Inference with llama.cpp and Ollama
|
| 488 |
+
|
| 489 |
+
We also support inference with [llama.cpp](https://github.com/ggml-org/llama.cpp) and [Ollama](https://ollama.com/).
|
| 490 |
+
|
| 491 |
+
##### llama.cpp
|
| 492 |
+
|
| 493 |
+
You can download the GGUF format of MiniCPM4.1-8B model from [huggingface](https://huggingface.co/openbmb/MiniCPM4.1-8B-GGUF) and run it with llama.cpp for efficient CPU or GPU inference.
|
| 494 |
+
```
|
| 495 |
+
# case 1: main-cli
|
| 496 |
+
./build/bin/llama-cli -m MiniCPM4.1-8B-Q4_K_M.gguf -p "Write an article about Artificial Intelligence." -n 1500
|
| 497 |
+
|
| 498 |
+
# case 2: server
|
| 499 |
+
## launch server
|
| 500 |
+
./build/bin/llama-server -m MiniCPM4.1-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080 -c 4096 -fa on &
|
| 501 |
+
|
| 502 |
+
## send request
|
| 503 |
+
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
|
| 504 |
+
-H "Content-Type: application/json" \
|
| 505 |
+
-d '{
|
| 506 |
+
"model": "gpt-3.5-turbo",
|
| 507 |
+
"messages": [{"role": "user", "content": "Write an article about Artificial Intelligence."}],
|
| 508 |
+
"max_tokens": 1500
|
| 509 |
+
}'
|
| 510 |
+
```
|
| 511 |
+
|
| 512 |
+
##### Ollama
|
| 513 |
+
Please refer to [model hub](https://ollama.com/openbmb/minicpm4.1) for model download. After installing ollama package, you can use MiniCPM4.1 with following commands:
|
| 514 |
+
```
|
| 515 |
+
ollama run openbmb/minicpm4.1
|
| 516 |
+
```
|
| 517 |
+
|
| 518 |
### Hybird Reasoning Mode
|
| 519 |
|
| 520 |
MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
|
|
|
|
| 550 |
|
| 551 |
```bibtex
|
| 552 |
@article{minicpm4,
|
| 553 |
+
title={Minicpm4: Ultra-efficient llms on end devices},
|
| 554 |
+
author={MiniCPM, Team},
|
| 555 |
+
journal={arXiv preprint arXiv:2506.07900},
|
| 556 |
year={2025}
|
| 557 |
}
|
| 558 |
```
|