SentenceTransformer based on izdastolga/checkpoint-18-Aixr-Turkish-QA

This is a sentence-transformers model finetuned from izdastolga/checkpoint-18-Aixr-Turkish-QA on the aixr-msmarco-rag-wikirag-gpt-soru_cevap-tur_hist_quad dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tizdas/checkpoint-1925-aixr-msmarco-rag-wikirag-gpt-soru_cevap-tur_hist_quad-0.8030")
# Run inference
sentences = [
    'Instruct: Given a Turkish search query, retrieve relevant passages written in Turkish that best answer the query.\nQuery: kiribati vanuatu nerede bulunur',
    "Kiribati'nin en doğudaki adaları, Hawaii'nin güneyindeki güney Line Adaları, Dünya'daki en gelişmiş zamana sahiptir: UTC + 14 saat. Kiribati 1979'da Birleşik Krallık'tan bağımsız oldu. Başkent ve şimdi en kalabalık bölge olan Güney Tarawa, bir dizi geçitle bağlı olan bir dizi adacıktan oluşur.",
    'Memlük Sultanı Mansur Seyfettin Kalaunun',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

aixr-msmarco-rag-wikirag-gpt-soru_cevap-tur_hist_quad

  • Dataset: aixr-msmarco-rag-wikirag-gpt-soru_cevap-tur_hist_quad at 5e80fa0
  • Size: 415,046 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 36 tokens
    • mean: 44.21 tokens
    • max: 131 tokens
    • min: 3 tokens
    • mean: 97.81 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    Instruct: Given a Turkish search query, retrieve relevant passages written in Turkish that best answer the query.
    Query: Bir balıkta kişi başına ne kadar balık kızartılır
    Kişi başına yaklaşık 4 ons veya 113.4 gram çiğ balık kızartması gerekir. 0.25 pound'a eşdeğerdir.
    Instruct: Given a Turkish search query, retrieve relevant passages written in Turkish that best answer the query.
    Query: İnsanın dış görünüşüne bakarak asla hüküm vermemeliyiz. Buna katılıyor musun, neden?
    Dış görünüş tek başına bir şey ifade etmez ama fikir verir. Fakat dış görünüş çok yanıltıcı da olabilmektedir. Bu nedenle her zaman dış görünüşten ziyade insanların karakterlerine odaklanmak önemlidir.
    Instruct: Given a Turkish search query, retrieve relevant passages written in Turkish that best answer the query.
    Query: Kürtçe nerede konuşulur
    Kürtçe, Türkiye, Kürdistan, İran, Suriye, Irak, İran, Ermenistan, Gürcistan ve Azerbaycan'ı kapsayan ve Avrupa ve Amerika Birleşik Devletleri'ne yayılmış büyük bir diasporaya sahip büyük bir bölgede konuşulan yakın ilişkili dillerin sürekliliğinden oluşan bir makro dildir. Status.
  • Loss: CachedMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 50,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 144
  • per_device_eval_batch_size: 144
  • learning_rate: 1e-06
  • weight_decay: 0.01
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True
  • tf32: True
  • dataloader_num_workers: 4
  • prompts: {'anchor': 'Instruct: Given a Turkish search query, retrieve relevant passages written in Turkish that best answer the query.\nQuery: '}

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 144
  • per_device_eval_batch_size: 144
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-06
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: {'anchor': 'Instruct: Given a Turkish search query, retrieve relevant passages written in Turkish that best answer the query.\nQuery: '}
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss
0.0997 275 0.33
0.1995 550 0.0964
0.2992 825 0.0862
0.3990 1100 0.081
0.4987 1375 0.0764
0.5985 1650 0.0745
0.6982 1925 0.0726

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.49.0
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 2.21.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tizdas/checkpoint-1925-aixr-msmarco-rag-wikirag-gpt-soru_cevap-tur_hist_quad-0.8030

Finetuned
(1)
this model