a new model trained entirely from scratch with Turkish data :: vngrs-ai/Kumru-2B

#1445

by NovaYear - opened 15 days ago

15 days ago

https://huggingface.co/vngrs-ai/Kumru-2B

The main model (Kumru) is based on the Mistral v0.3 and LLaMA-3 architectures among large language models. It was designed specifically for a decoder-only approach and compatibility with modern architectural approaches. These architectures are high-performance and efficient solutions widely used in the development of deep learning-based large language models. In other words, Kumru was trained from scratch on Turkish data, undergoing a completely new pre-training/instruction tuning process, without transfer learning.

In addition, Nvidia's H200 GPUs were used in the training infrastructure, and modern optimization techniques (such as the AdamW optimizer, flash-attention, and mixed-precision training) were applied during the training process.

In conclusion, it was developed as a modern large language model based on Mistral + LLaMA-3.

I believe that quantizing this model will benefit many Turkish users. Thank you.

nicoboss

14 days ago

This model unfortunately seams to have multiple issues that make it not possible to convert to GGUF using llama.cpp

The BPE pre-tokenizer is missing from llama.cpp. This probably means a custom tokenizer was used. If we know a one supported by llama.cpp is compatible with this model we could use it instead but especially for a language specific mode with their own architecture we should only do so after the original model author confirms that doing so won't degrade the models quality.
FileNotFoundError: File not found: Kumru-2B/tokenizer.model so even once we specify a PRE pre-tokenizer we likely need a tokenizer.model file as tokenizer.json does not seam enough to convert this specific model.

Here the error we got when trying to convert this model:

]
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggml-org/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  ddc13456fcb8e77e7d2d74c34572006552a5ec4b91bd9d2a896934e0c1a6f01c
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

Traceback (most recent call last):
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 2105, in set_vocab
    self._set_vocab_sentencepiece()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 994, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 1011, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: Kumru-2B/tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 2108, in set_vocab
    self._set_vocab_llama_hf()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 1096, in _set_vocab_llama_hf
    vocab = gguf.LlamaHfVocab(self.dir_model)
  File "/llmjob/llama.cpp/gguf-py/gguf/vocab.py", line 515, in __init__
    raise TypeError('Llama 3 must be converted with BpeVocab')
TypeError: Llama 3 must be converted with BpeVocab

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 9418, in <module>
    main()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 9412, in main
    model_instance.write()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 440, in write
    self.prepare_metadata(vocab_only=False)
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 561, in prepare_metadata
    self.set_vocab()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 2111, in set_vocab
    self._set_vocab_gpt2()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 930, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 651, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
  File "/llmjob/llama.cpp/convert_hf_to_gguf.py", line 918, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment