esci-nomic-embed-text-v1_5 / README.md

lv12

Add new SentenceTransformer model.

d791124 verified over 1 year ago

preview code

raw

history blame contribute delete

30.5 kB

metadata

language: []
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dataset_size:100K<n<1M
  - loss:CachedMultipleNegativesRankingLoss
base_model: nomic-ai/nomic-embed-text-v1.5
metrics:
  - cosine_accuracy
  - dot_accuracy
  - manhattan_accuracy
  - euclidean_accuracy
  - max_accuracy
widget:
  - source_sentence: 'search_query: shark'
    sentences:
      - 'search_query: skull'
      - 'search_query: car picture frame'
      - 'search_query: cartera de guchi'
  - source_sentence: 'search_query: aolvo'
    sentences:
      - 'search_query: laço homem'
      - 'search_query: vdi to hdmi cable'
      - 'search_query: beads without holes'
  - source_sentence: 'search_query: 赤色のカバン'
    sentences:
      - 'search_query: 結婚式 ガーランド'
      - 'search_query: remaches zapatero'
      - 'search_query: small feaux potted plants'
  - source_sentence: 'search_query: vipkid'
    sentences:
      - 'search_query: ceiling lamps for kids'
      - 'search_query: apple あいふぉんケース 12'
      - 'search_query: zapatos zaragoza mujer'
  - source_sentence: 'search_query: お布団バッグ'
    sentences:
      - 'search_query: 足なしソファー'
      - 'search_query: all color handbag'
      - 'search_query: tundra black out emblems'
pipeline_tag: sentence-similarity
model-index:
  - name: SentenceTransformer based on nomic-ai/nomic-embed-text-v1.5
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: triplet esci
          type: triplet-esci
        metrics:
          - type: cosine_accuracy
            value: 0.787
            name: Cosine Accuracy
          - type: dot_accuracy
            value: 0.22
            name: Dot Accuracy
          - type: manhattan_accuracy
            value: 0.762
            name: Manhattan Accuracy
          - type: euclidean_accuracy
            value: 0.768
            name: Euclidean Accuracy
          - type: max_accuracy
            value: 0.787
            name: Max Accuracy

SentenceTransformer based on nomic-ai/nomic-embed-text-v1.5

This is a sentence-transformers model finetuned from nomic-ai/nomic-embed-text-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: nomic-ai/nomic-embed-text-v1.5
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'search_query: お布団バッグ',
    'search_query: 足なしソファー',
    'search_query: all color handbag',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Dataset: triplet-esci
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.787
dot_accuracy	0.22
manhattan_accuracy	0.762
euclidean_accuracy	0.768
max_accuracy	0.787

Training Details

Training Dataset

Unnamed Dataset

Size: 100,000 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 7 tokens mean: 12.11 tokens max: 47 tokens	min: 17 tokens mean: 49.91 tokens max: 166 tokens	min: 20 tokens mean: 50.64 tokens max: 152 tokens

Samples:

anchor	positive	negative
`search_query: blー5c`	`search_document: [EnergyPower] TECSUN PL-368 電池2個セット SSB・同期検波・長波 [交換用バッテリーBL-5C付] デジタルDSPポケット短波ラジオ超小型長・中波用外付アンテナ 10キーポータブルBCL受信機 FMステレオ/LW/MW/SW ワールドバンドレシーバー 850局プリセットメモリーシグナルメーター USB充電スリープタイマーアラー, TECSUN, PL-368 電池＋セット [ブラック]`	`search_document: RADIWOWで作る SIHUADON R108 ポータブル BCL短波ラジオAM FM LW SW 航空無線 DSPレシーバー LCD 良好屋内および屋外アクティビティの両親への贈り物, RADIWOW, グレー`
`search_query: かわいいロングtシャツ`	`search_document: レディースロンt 半袖 tシャツオーバーサイズコットンスリット大きいサイズ白シャツビッグシルエットワンピースシャツワンピロングｔシャツおおきいサイズ夏ピンクカジュアルカップ付きカーディガンキラキラキャミソールキャミサテンシンプルシニアシフォンシースルーシ, Sleeping Sheep(スリーピングシープ), ホワイト`	`search_document: Perkisboby スポーツウェアレディースヨガウェア 4点セット上下セット 5点セットウェアフィットネス 2点セットジャージスポーツブラパンツパーカー半袖ハーフパンツ, Perkisboby, 2点セット-グレー`
`search_query: iphone xr otterbox symmetry case`	`search_document: Symmetry Clear Series Case for iPhone XR (ONLY) Symmetry Case for iPhone XR Symmetry Case - Clear, VTSOU, Clear`	`search_document: OtterBox Symmetry Series Case for Apple iPhone XS Max - Tonic Violet / Purple, OtterBox, Tonic Violet / Purple`

Loss: CachedMultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

Unnamed Dataset

Size: 1,000 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 7 tokens mean: 12.13 tokens max: 49 tokens	min: 15 tokens mean: 50.76 tokens max: 173 tokens	min: 18 tokens mean: 54.25 tokens max: 161 tokens

Samples:

anchor	positive	negative
`search_query: snack vending machine`	`search_document: Red All Metal Triple Compartment Commercial Vending Machine for 1 inch Gumballs, 1 inch Toy Capsules, Bouncy Balls, Candy, Nuts with Stand by American Gumball Company, American Gumball Company, CANDY RED`	`search_document: Vending Machine Halloween Costume - Funny Snack Food Adult Men & Women Outfits, Hauntlook, Multicolored`
`search_query: slim credit card holder without id window`	`search_document: Banuce Top Grain Leather Card Holder for Women Men Unisex ID Credit Card Case Slim Card Wallet Black, Banuce, 1 ID + 5 Card Slots: Black`	`search_document: Mens Wallet RFID Genuine Leather Bifold Wallets For Men, ID Window 16 Card Holders Gift Box, Swallowmall, Black Stripe`
`search_query: gucci belts for women`	`search_document: Gucci Women's Gg0027o 50Mm Optical Glasses, Gucci, Havana`	`search_document: Gucci G-Gucci Gold PVD Women's Watch(Model:YA125511), Gucci, PVD/Brown`

Loss: CachedMultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 4
per_device_eval_batch_size: 4
gradient_accumulation_steps: 2
learning_rate: 1e-06
lr_scheduler_type: cosine
warmup_ratio: 0.1
dataloader_drop_last: True
dataloader_num_workers: 4
dataloader_prefetch_factor: 2
load_best_model_at_end: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
prediction_loss_only: True
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 2
eval_accumulation_steps: None
learning_rate: 1e-06
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: cosine
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: True
dataloader_num_workers: 4
dataloader_prefetch_factor: 2
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss	loss	triplet-esci_cosine_accuracy
0.008	100	0.7191	-	-
0.016	200	0.6917	-	-
0.024	300	0.7129	-	-
0.032	400	0.6826	-	-
0.04	500	0.7317	-	-
0.048	600	0.7237	-	-
0.056	700	0.6904	-	-
0.064	800	0.6815	-	-
0.072	900	0.6428	-	-
0.08	1000	0.6561	0.6741	0.74
0.088	1100	0.6097	-	-
0.096	1200	0.6426	-	-
0.104	1300	0.618	-	-
0.112	1400	0.6346	-	-
0.12	1500	0.611	-	-
0.128	1600	0.6092	-	-
0.136	1700	0.6512	-	-
0.144	1800	0.646	-	-
0.152	1900	0.6584	-	-
0.16	2000	0.6403	0.6411	0.747
0.168	2100	0.5882	-	-
0.176	2200	0.6361	-	-
0.184	2300	0.5641	-	-
0.192	2400	0.5734	-	-
0.2	2500	0.6156	-	-
0.208	2600	0.6252	-	-
0.216	2700	0.634	-	-
0.224	2800	0.5743	-	-
0.232	2900	0.5222	-	-
0.24	3000	0.5604	0.6180	0.765
0.248	3100	0.5864	-	-
0.256	3200	0.5541	-	-
0.264	3300	0.5661	-	-
0.272	3400	0.5493	-	-
0.28	3500	0.556	-	-
0.288	3600	0.56	-	-
0.296	3700	0.5552	-	-
0.304	3800	0.5833	-	-
0.312	3900	0.5578	-	-
0.32	4000	0.5495	0.6009	0.769
0.328	4100	0.5245	-	-
0.336	4200	0.477	-	-
0.344	4300	0.5536	-	-
0.352	4400	0.5493	-	-
0.36	4500	0.532	-	-
0.368	4600	0.5341	-	-
0.376	4700	0.528	-	-
0.384	4800	0.5574	-	-
0.392	4900	0.4953	-	-
0.4	5000	0.5365	0.5969	0.779
0.408	5100	0.4835	-	-
0.416	5200	0.4573	-	-
0.424	5300	0.5554	-	-
0.432	5400	0.5623	-	-
0.44	5500	0.5955	-	-
0.448	5600	0.5086	-	-
0.456	5700	0.5081	-	-
0.464	5800	0.4829	-	-
0.472	5900	0.5066	-	-
0.48	6000	0.4997	0.5920	0.776
0.488	6100	0.5075	-	-
0.496	6200	0.5051	-	-
0.504	6300	0.5019	-	-
0.512	6400	0.4774	-	-
0.52	6500	0.4975	-	-
0.528	6600	0.4756	-	-
0.536	6700	0.4656	-	-
0.544	6800	0.4671	-	-
0.552	6900	0.4646	-	-
0.56	7000	0.5595	0.5853	0.777
0.568	7100	0.4812	-	-
0.576	7200	0.506	-	-
0.584	7300	0.49	-	-
0.592	7400	0.464	-	-
0.6	7500	0.441	-	-
0.608	7600	0.4492	-	-
0.616	7700	0.457	-	-
0.624	7800	0.493	-	-
0.632	7900	0.4174	-	-
0.64	8000	0.4686	0.5809	0.785
0.648	8100	0.4529	-	-
0.656	8200	0.4784	-	-
0.664	8300	0.4697	-	-
0.672	8400	0.4489	-	-
0.68	8500	0.4439	-	-
0.688	8600	0.4063	-	-
0.696	8700	0.4634	-	-
0.704	8800	0.4446	-	-
0.712	8900	0.4725	-	-
0.72	9000	0.3954	0.5769	0.781
0.728	9100	0.4536	-	-
0.736	9200	0.4583	-	-
0.744	9300	0.4415	-	-
0.752	9400	0.4716	-	-
0.76	9500	0.4393	-	-
0.768	9600	0.4332	-	-
0.776	9700	0.4236	-	-
0.784	9800	0.4021	-	-
0.792	9900	0.4324	-	-
0.8	10000	0.4197	0.5796	0.78
0.808	10100	0.4576	-	-
0.816	10200	0.4238	-	-
0.824	10300	0.4468	-	-
0.832	10400	0.4301	-	-
0.84	10500	0.414	-	-
0.848	10600	0.4563	-	-
0.856	10700	0.4212	-	-
0.864	10800	0.3905	-	-
0.872	10900	0.4384	-	-
0.88	11000	0.3474	0.5709	0.788
0.888	11100	0.4396	-	-
0.896	11200	0.3819	-	-
0.904	11300	0.3748	-	-
0.912	11400	0.4217	-	-
0.92	11500	0.3893	-	-
0.928	11600	0.3835	-	-
0.936	11700	0.4303	-	-
0.944	11800	0.4274	-	-
0.952	11900	0.4089	-	-
0.96	12000	0.4009	0.5710	0.786
0.968	12100	0.3832	-	-
0.976	12200	0.3543	-	-
0.984	12300	0.4866	-	-
0.992	12400	0.4531	-	-
1.0	12500	0.3728	-	-
1.008	12600	0.386	-	-
1.016	12700	0.3622	-	-
1.024	12800	0.4013	-	-
1.032	12900	0.3543	-	-
1.04	13000	0.3918	0.5712	0.792
1.048	13100	0.3961	-	-
1.056	13200	0.3804	-	-
1.064	13300	0.4049	-	-
1.072	13400	0.3374	-	-
1.08	13500	0.3746	-	-
1.088	13600	0.3162	-	-
1.096	13700	0.3536	-	-
1.104	13800	0.3101	-	-
1.112	13900	0.3704	-	-
1.12	14000	0.3412	0.5758	0.788
1.1280	14100	0.342	-	-
1.1360	14200	0.383	-	-
1.144	14300	0.3554	-	-
1.152	14400	0.4013	-	-
1.16	14500	0.3486	-	-
1.168	14600	0.3367	-	-
1.176	14700	0.3737	-	-
1.184	14800	0.319	-	-
1.192	14900	0.3211	-	-
1.2	15000	0.3284	0.5804	0.787

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.0.0
Transformers: 4.38.2
PyTorch: 2.1.2+cu121
Accelerate: 0.27.2
Datasets: 2.19.1
Tokenizers: 0.15.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}