a new model trained entirely from scratch with Turkish data :: vngrs-ai/Kumru-2B
https://huggingface.co/vngrs-ai/Kumru-2B
The main model (Kumru) is based on the Mistral v0.3 and LLaMA-3 architectures among large language models. It was designed specifically for a decoder-only approach and compatibility with modern architectural approaches. These architectures are high-performance and efficient solutions widely used in the development of deep learning-based large language models. In other words, Kumru was trained from scratch on Turkish data, undergoing a completely new pre-training/instruction tuning process, without transfer learning.
In addition, Nvidia's H200 GPUs were used in the training infrastructure, and modern optimization techniques (such as the AdamW optimizer, flash-attention, and mixed-precision training) were applied during the training process.
In conclusion, it was developed as a modern large language model based on Mistral + LLaMA-3.
I believe that quantizing this model will benefit many Turkish users. Thank you.
This model unfortunately seams to have multiple issues that make it not possible to convert to GGUF using llama.cpp
- The BPE pre-tokenizer is missing from llama.cpp. This probably means a custom tokenizer was used. If we know a one supported by llama.cpp is compatible with this model we could use it instead but especially for a language specific mode with their own architecture we should only do so after the original model author confirms that doing so won't degrade the models quality.
FileNotFoundError: File not found: Kumru-2B/tokenizer.modelso even once we specify a PRE pre-tokenizer we likely need atokenizer.modelfile astokenizer.jsondoes not seam enough to convert this specific model.
Here the error we got when trying to convert this model:
]
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:** There are 2 possible reasons for this:
WARNING:hf-to-gguf:** - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:** - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:** Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref: https://github.com/ggml-org/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh: ddc13456fcb8e77e7d2d74c34572006552a5ec4b91bd9d2a896934e0c1a6f01c
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:
Traceback (most recent call last):
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 2105, in set_vocab
self._set_vocab_sentencepiece()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 994, in _set_vocab_sentencepiece
tokens, scores, toktypes = self._create_vocab_sentencepiece()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 1011, in _create_vocab_sentencepiece
raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: Kumru-2B/tokenizer.model
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 2108, in set_vocab
self._set_vocab_llama_hf()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 1096, in _set_vocab_llama_hf
vocab = gguf.LlamaHfVocab(self.dir_model)
File "/llmjob/llama.cpp/gguf-py/gguf/vocab.py", line 515, in __init__
raise TypeError('Llama 3 must be converted with BpeVocab')
TypeError: Llama 3 must be converted with BpeVocab
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 9418, in <module>
main()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 9412, in main
model_instance.write()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 440, in write
self.prepare_metadata(vocab_only=False)
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 561, in prepare_metadata
self.set_vocab()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 2111, in set_vocab
self._set_vocab_gpt2()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 930, in _set_vocab_gpt2
tokens, toktypes, tokpre = self.get_vocab_base()
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 651, in get_vocab_base
tokpre = self.get_vocab_base_pre(tokenizer)
File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 918, in get_vocab_base_pre
raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()