SentenceTransformer based on nomic-ai/nomic-embed-text-v1.5

This is a sentence-transformers model finetuned from nomic-ai/nomic-embed-text-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: nomic-ai/nomic-embed-text-v1.5
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'NomicBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("JahnaviKumar/nomic-embed-text1.5-ftcode")
# Run inference
queries = [
    "What is the total CO2 emission from all aquaculture farms in the year 2021?",
]
documents = [
    'SELECT SUM(co2_emission) FROM co2_emission WHERE year = 2021;',
    '\n\treturn c.postJSON("joberror", args)\n}',
    ' && value.size == value.uniq.size\n      else\n        result\n      end\n    end',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 768] [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.7075, 0.3913, 0.3213]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 100 training samples
  • Columns: query and corpus
  • Approximate statistics based on the first 100 samples:
    query corpus
    type string string
    details
    • min: 6 tokens
    • mean: 138.88 tokens
    • max: 1004 tokens
    • min: 6 tokens
    • mean: 95.76 tokens
    • max: 1151 tokens
  • Samples:
    query corpus
    def add_data_file(data_files, target, source):
    """Add an entry to data_files"""
    for t, f in data_files:
    if t == target:
    break
    else:
    data_files.append((target, []))
    f = data_files[-1][1]
    if source not in f:
    f.append(source)
    function verify (token, options) {
    options = options || {}
    options.issuer = options.issuer || this.issuer
    options.client_id = options.client_id || this.client_id
    options.client_secret = options.client_secret || this.client_secret
    options.scope = options.scope || this.scope
    options.key = options.key || this.jwks.sig

    return new Promise(function (resolve, reject) {
    AccessToken.verify(token, options, function (err, claims) {
    if (err) { return reject(err) }
    resolve(claims)
    })
    })
    }
    Verifies a given OIDC token
    @method verify
    @param token {String} JWT AccessToken for OpenID Connect (base64 encoded)
    @param [options={}] {Object} Options hashmap
    @param [options.issuer] {String} OIDC Provider/Issuer URL
    @param [options.key] {Object} Issuer's public key for signatures (jwks.sig)
    @param [options.client_id] {String}
    @param [options.client_secret {String}
    @param [options.scope] {String}
    @throws {UnauthorizedError} HTTP 401 or 403 errors (invalid tokens etc)
    @return {Promise}
    def _combine_lines(self, lines):
    """
    Combines a list of JSON objects into one JSON object.
    """
    lines = filter(None, map(lambda x: x.strip(), lines))
    return '[' + ','.join(lines) + ']'
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 5.1.1
  • Transformers: 4.54.1
  • PyTorch: 2.9.0+cu128
  • Accelerate: 1.10.1
  • Datasets: 4.2.0
  • Tokenizers: 0.21.4

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
23
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JahnaviKumar/nomic-embed-text1.5-ftcode

Finetuned
(23)
this model