xcjthu commited on
Commit
29e0d7a
Β·
1 Parent(s): 6170bc9

add usage for llama.cpp and ollama

Browse files
Files changed (1) hide show
  1. README.md +35 -2
README.md CHANGED
@@ -20,6 +20,7 @@ library_name: transformers
20
  </p>
21
 
22
  ## What's New
 
23
  - [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. πŸ”₯πŸ”₯πŸ”₯
24
  - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).πŸ”₯πŸ”₯πŸ”₯
25
 
@@ -483,6 +484,37 @@ python3 -m cpmcu.cli \
483
 
484
  For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
485
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
486
  ### Hybird Reasoning Mode
487
 
488
  MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
@@ -518,8 +550,9 @@ prompt_text = tokenizer.apply_chat_template(
518
 
519
  ```bibtex
520
  @article{minicpm4,
521
- title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
522
- author={MiniCPM Team},
 
523
  year={2025}
524
  }
525
  ```
 
20
  </p>
21
 
22
  ## What's New
23
+ - [2025.09.29] **[InfLLM-V2 paper](https://arxiv.org/abs/2509.24663) is released!** We can train a sparse attention model with only 5B long-text tokens. πŸ”₯πŸ”₯πŸ”₯
24
  - [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. πŸ”₯πŸ”₯πŸ”₯
25
  - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).πŸ”₯πŸ”₯πŸ”₯
26
 
 
484
 
485
  For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
486
 
487
+ ### Inference with llama.cpp and Ollama
488
+
489
+ We also support inference with [llama.cpp](https://github.com/ggml-org/llama.cpp) and [Ollama](https://ollama.com/).
490
+
491
+ ##### llama.cpp
492
+
493
+ You can download the GGUF format of MiniCPM4.1-8B model from [huggingface](https://huggingface.co/openbmb/MiniCPM4.1-8B-GGUF) and run it with llama.cpp for efficient CPU or GPU inference.
494
+ ```
495
+ # case 1: main-cli
496
+ ./build/bin/llama-cli -m MiniCPM4.1-8B-Q4_K_M.gguf -p "Write an article about Artificial Intelligence." -n 1500
497
+
498
+ # case 2: server
499
+ ## launch server
500
+ ./build/bin/llama-server -m MiniCPM4.1-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080 -c 4096 -fa on &
501
+
502
+ ## send request
503
+ curl -X POST http://127.0.0.1:8080/v1/chat/completions \
504
+ -H "Content-Type: application/json" \
505
+ -d '{
506
+ "model": "gpt-3.5-turbo",
507
+ "messages": [{"role": "user", "content": "Write an article about Artificial Intelligence."}],
508
+ "max_tokens": 1500
509
+ }'
510
+ ```
511
+
512
+ ##### Ollama
513
+ Please refer to [model hub](https://ollama.com/openbmb/minicpm4.1) for model download. After installing ollama package, you can use MiniCPM4.1 with following commands:
514
+ ```
515
+ ollama run openbmb/minicpm4.1
516
+ ```
517
+
518
  ### Hybird Reasoning Mode
519
 
520
  MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
 
550
 
551
  ```bibtex
552
  @article{minicpm4,
553
+ title={Minicpm4: Ultra-efficient llms on end devices},
554
+ author={MiniCPM, Team},
555
+ journal={arXiv preprint arXiv:2506.07900},
556
  year={2025}
557
  }
558
  ```