| This model has been quantized using [GPTQModel](https://github.com/ModelCloud/GPTQModel). | |
| - **bits**: 4 | |
| - **group_size**: 128 | |
| - **desc_act**: true | |
| - **static_groups**: false | |
| - **sym**: true | |
| - **lm_head**: false | |
| - **damp_percent**: 0.01 | |
| - **true_sequential**: true | |
| - **model_name_or_path**: "" | |
| - **model_file_base_name**: "model" | |
| - **quant_method**: "gptq" | |
| - **checkpoint_format**: "gptq" | |
| - **meta**: | |
| - **quantizer**: "gptqmodel:0.9.9-dev0" | |
| You can use [GPTQModel](https://github.com/ModelCloud/GPTQModel) for model inference. | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, GenerationConfig | |
| from gptqmodel import GPTQModel | |
| model_name = "ModelCloud/DeepSeek-V2-Chat-0628-gptq-4bit" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | |
| # `max_memory` should be set based on your devices | |
| max_memory = {i: "75GB" for i in range(2)} | |
| # `device_map` cannot be set to `auto` | |
| model = GPTQModel.from_quantized(model_name, trust_remote_code=True, device_map="sequential", max_memory=max_memory, torch_dtype=torch.float16, attn_implementation="eager") | |
| model.generation_config = GenerationConfig.from_pretrained(model_name) | |
| model.generation_config.pad_token_id = model.generation_config.eos_token_id | |
| messages = [ | |
| {"role": "user", "content": "Write a piece of quicksort code in C++"} | |
| ] | |
| input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") | |
| outputs = model.generate(input_ids=input_tensor.to(model.device), max_new_tokens=100) | |
| result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True) | |
| print(result) | |
| ``` |