Text Generation
Transformers
Safetensors
minimax_m2
conversational
custom_code
fp8
sriting commited on
Commit
cad818a
·
1 Parent(s): c706fa1

update mlx-deploy-guide

Browse files
Files changed (2) hide show
  1. README.md +5 -74
  2. docs/mlx_deploy_guide.md +73 -0
README.md CHANGED
@@ -167,80 +167,6 @@ We look forward to your feedback and to collaborating with developers and resear
167
 
168
  Download the model from HuggingFace repository: https://huggingface.co/MiniMaxAI/MiniMax-M2. We recommend using the following inference frameworks (listed alphabetically) to serve the model:
169
 
170
- Here's an improved, polished, and professional version of your documentation with better structure, clarity, grammar, accuracy, and usability:
171
-
172
-
173
- ### MLX
174
-
175
- Run, serve, and fine-tune **MiniMax-M2** locally on your Mac using the **MLX** framework. This guide gets you up and running quickly.
176
-
177
- > **Requirements**
178
- > - Apple Silicon Mac (M3 Ultra or later)
179
- > - **At least 256GB of unified memory (RAM)**
180
-
181
-
182
- **Installation**
183
-
184
- Install the `mlx-lm` package via pip:
185
-
186
- ```bash
187
- pip install mlx-lm
188
- ```
189
-
190
- **CLI**
191
-
192
- Generate text directly from the terminal:
193
-
194
- ```bash
195
- mlx_lm.generate \
196
- --model mlx-community/MiniMax-M2-4bit \
197
- --prompt "How tall is Mount Everest?"
198
- ```
199
-
200
- > Add `--max-tokens 256` to control response length, or `--temp 0.7` for creativity.
201
-
202
- **Python Script Example**
203
-
204
- Use `mlx-lm` in your own Python scripts:
205
-
206
- ```python
207
- from mlx_lm import load, generate
208
-
209
- # Load the quantized model
210
- model, tokenizer = load("mlx-community/MiniMax-M2-4bit")
211
-
212
- prompt = "Hello, how are you?"
213
-
214
- # Apply chat template if available (recommended for chat models)
215
- if tokenizer.chat_template is not None:
216
- messages = [{"role": "user", "content": prompt}]
217
- prompt = tokenizer.apply_chat_template(
218
- messages,
219
- tokenize=False,
220
- add_generation_prompt=True
221
- )
222
-
223
- # Generate response
224
- response = generate(
225
- model,
226
- tokenizer,
227
- prompt=prompt,
228
- max_tokens=256,
229
- temp=0.7,
230
- verbose=True
231
- )
232
-
233
- print(response)
234
- ```
235
-
236
- **Tips**
237
- - **Model variants**: Check [Hugging Face](https://huggingface.co/collections/mlx-community/minimax-m2) for `MiniMax-M2-4bit`, `6bit`, `8bit`, or `bfloat16` versions.
238
- - **Fine-tuning**: Use `mlx-lm.lora` for efficient parameter-efficient fine-tuning (PEFT).
239
-
240
- **Resources**
241
- - GitHub: [https://github.com/ml-explore/mlx-lm](https://github.com/ml-explore/mlx-lm)
242
- - Models: [https://huggingface.co/mlx-community](https://huggingface.co/mlx-community)
243
-
244
  ### SGLang
245
 
246
  We recommend using [SGLang](https://docs.sglang.ai/) to serve MiniMax-M2. SGLang provides solid day-0 support for MiniMax-M2 model. Please refer to our [SGLang Deployment Guide](https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/docs/sglang_deploy_guide.md) for more details, and thanks so much for our collaboration with the SGLang team.
@@ -249,6 +175,11 @@ We recommend using [SGLang](https://docs.sglang.ai/) to serve MiniMax-M2. SGLang
249
 
250
  We recommend using [vLLM](https://docs.vllm.ai/en/stable/) to serve MiniMax-M2. vLLM provides efficient day-0 support of MiniMax-M2 model, check https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html for latest deployment guide. We also provide our [vLLM Deployment Guide](https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/docs/vllm_deploy_guide.md).
251
 
 
 
 
 
 
252
  ### Inference Parameters
253
  We recommend using the following parameters for best performance: `temperature=1.0`, `top_p = 0.95`, `top_k = 40`.
254
 
 
167
 
168
  Download the model from HuggingFace repository: https://huggingface.co/MiniMaxAI/MiniMax-M2. We recommend using the following inference frameworks (listed alphabetically) to serve the model:
169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  ### SGLang
171
 
172
  We recommend using [SGLang](https://docs.sglang.ai/) to serve MiniMax-M2. SGLang provides solid day-0 support for MiniMax-M2 model. Please refer to our [SGLang Deployment Guide](https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/docs/sglang_deploy_guide.md) for more details, and thanks so much for our collaboration with the SGLang team.
 
175
 
176
  We recommend using [vLLM](https://docs.vllm.ai/en/stable/) to serve MiniMax-M2. vLLM provides efficient day-0 support of MiniMax-M2 model, check https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html for latest deployment guide. We also provide our [vLLM Deployment Guide](https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/docs/vllm_deploy_guide.md).
177
 
178
+ ### MLX
179
+
180
+ We recommend using [MLX-LM](https://github.com/ml-explore/mlx-lm) to serve MiniMax-M2. Please refer to our [MLX Deployment Guide](https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/docs/mlx_deploy_guide.md) for more details.
181
+
182
+
183
  ### Inference Parameters
184
  We recommend using the following parameters for best performance: `temperature=1.0`, `top_p = 0.95`, `top_k = 40`.
185
 
docs/mlx_deploy_guide.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Here's an improved, polished, and professional version of your documentation with better structure, clarity, grammar, accuracy, and usability:
2
+
3
+
4
+ ### MLX
5
+
6
+ Run, serve, and fine-tune [**MiniMax-M2**](https://huggingface.co/MiniMaxAI/MiniMax-M2) locally on your Mac using the **MLX** framework. This guide gets you up and running quickly.
7
+
8
+ > **Requirements**
9
+ > - Apple Silicon Mac (M3 Ultra or later)
10
+ > - **At least 256GB of unified memory (RAM)**
11
+
12
+
13
+ **Installation**
14
+
15
+ Install the `mlx-lm` package via pip:
16
+
17
+ ```bash
18
+ pip install mlx-lm
19
+ ```
20
+
21
+ **CLI**
22
+
23
+ Generate text directly from the terminal:
24
+
25
+ ```bash
26
+ mlx_lm.generate \
27
+ --model mlx-community/MiniMax-M2-4bit \
28
+ --prompt "How tall is Mount Everest?"
29
+ ```
30
+
31
+ > Add `--max-tokens 256` to control response length, or `--temp 0.7` for creativity.
32
+
33
+ **Python Script Example**
34
+
35
+ Use `mlx-lm` in your own Python scripts:
36
+
37
+ ```python
38
+ from mlx_lm import load, generate
39
+
40
+ # Load the quantized model
41
+ model, tokenizer = load("mlx-community/MiniMax-M2-4bit")
42
+
43
+ prompt = "Hello, how are you?"
44
+
45
+ # Apply chat template if available (recommended for chat models)
46
+ if tokenizer.chat_template is not None:
47
+ messages = [{"role": "user", "content": prompt}]
48
+ prompt = tokenizer.apply_chat_template(
49
+ messages,
50
+ tokenize=False,
51
+ add_generation_prompt=True
52
+ )
53
+
54
+ # Generate response
55
+ response = generate(
56
+ model,
57
+ tokenizer,
58
+ prompt=prompt,
59
+ max_tokens=256,
60
+ temp=0.7,
61
+ verbose=True
62
+ )
63
+
64
+ print(response)
65
+ ```
66
+
67
+ **Tips**
68
+ - **Model variants**: Check [Hugging Face](https://huggingface.co/collections/mlx-community/minimax-m2) for `MiniMax-M2-4bit`, `6bit`, `8bit`, or `bfloat16` versions.
69
+ - **Fine-tuning**: Use `mlx-lm.lora` for efficient parameter-efficient fine-tuning (PEFT).
70
+
71
+ **Resources**
72
+ - GitHub: [https://github.com/ml-explore/mlx-lm](https://github.com/ml-explore/mlx-lm)
73
+ - Models: [https://huggingface.co/mlx-community](https://huggingface.co/mlx-community)