Slow inference performance when using nomic-embed-text-v1.5

#34
by umesh-c - opened

Hello there,
After considering multiple aspects of this model, we thought to give it a shot over bge-large-en. The first observation is that it is running pretty slow even when running on GPU. My code looks like this :

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from models.semantic_search_gen import SemanticSearchGen

class NomicEmbedText(SemanticSearchGen):
    """Implementation to use vector embedding model nomic-embed-text-v1.5"""
    def __init__(self):
        """
        Constructor to initialize the model
        """
        super().__init__("nomic-embed-text-v1.5")
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
        self.model = AutoModel.from_pretrained(self.model_path, trust_remote_code=True, safe_serialization=True)
        self.model.eval()
        self.model.max_seq_length = self.model_conf["token_limit"]
        self.token_limit = self.model.max_seq_length

    def mean_pooling(self, model_output, attention_mask):
        """
        Implementation to calculate embeddings after mean pooling
        :param model_output:
        :param attention_mask:
        :return: Vector embeddings after mean pooling
        """
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    def gen_embeddings(self, text, prompt=False, text_type="search_document"):
        """
        Generate vector embeddings
        :param text: input text to generate the embeddings
        :param prompt: Not applicable here
        :param text_type: Default is "search_document:" to generate the embeddings for search document. If embedding is getting generated for search query then it needs to passed here
        :return: Vector embeddings in the flot values list
        """

        if text_type == "search_document":
            instruction = "search_document: "
        else:
            instruction = "search_query: "

        encoded_input = self.tokenizer(instruction + text, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = self.model(**encoded_input)
        embeddings = self.mean_pooling(model_output, encoded_input['attention_mask'])
        embeddings = F.normalize(embeddings, p=2, dim=1)
        # Convert embeddings from Tensor to NumPy and return an array of floats
        embeddings = embeddings.numpy()[0]
        return embeddings

    def get_model(self):
        return self.model

Can the slowness be due to pooling calculation or not utilizing GPUs while calculating the embeddings?
Due to some conversion issue, I was not able to run it via sentence-transformers as it was giving me some torch conversion related stacktrace.

Thanks!

Hello!

I can think of two causes here:

  1. (Most likely) Nomic's tokenizer accepts much longer inputs than bge-large-en-v1.5: 8192 instead of 512. This means that the model has to process a ton more tokens, and most encoder models get exponentially slower the longer the inputs, so this is a very likely cause. If you don't want to get increased inference times, you might want to set tokenizer.model_max_length = 512 and then try to test whether you get improved performance.
  2. (Less likely) bge-large-en-v1.5 via SentenceTransformers automatically does batching (32 samples per inference by default, I believe). This is quite a bit faster than doing 32 inferences of 1 sample.

Also,

Due to some conversion issue, I was not able to run it via sentence-transformers as it was giving me some torch conversion related stacktrace.

This is a shame :/ Could you post the stacktrace when you run:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
embeddings = model.encode(sentences)
print(embeddings)
  • Tom Aarsen

Hi Tom,
Thanks for pointing the likely causes behind the slowness.
I am going to try tokenizer.model_max_length = 512 and will update you shortly.

Regarding the inability to use SentenceTransformer, here is the stacktrace which I get whenever I use SentenceTransformer for nomic :

/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
<All keys matched successfully>
Fatal Python error: Aborted

Thread 0x00000002a72ff000 (most recent call first):
  File "/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 331 in wait
  File "/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 629 in wait
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/opt/homebrew/Cellar/[email protected]/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1002 in _bootstrap

<Truncated some other threads to show the main culprit, which is below> 

Current thread 0x00000001f59e2500 (most recent call first):
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160 in convert
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 805 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1174 in to
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 215 in __init__
  File "/Users/umesh/git-repos/genai_search/models/impl/nomic_embed_text_v1.py", line 18 in __init__
  File "/Users/umesh/git-repos/genai_search/models/model_factory.py", line 28 in __getattr__
  File "/Users/umesh/git-repos/genai_search/streaming_processor.py", line 315 in apply_cleanup_and_store
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/util.py", line 81 in wrapper
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 1745 in processPartition
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 828 in func
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 820 in process
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 830 in main
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 74 in worker
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193 in manager
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 218 in <module>
  File "<frozen runpy>", line 88 in _run_code
  File "<frozen runpy>", line 198 in _run_module_as_main 
pip list | grep -ia transformer                                                                                                                  
sentence-transformers                   2.6.1
transformers                            4.37.0

Thanks!

Here is the torch version :

pip list | grep torch                                                                                                                                 
torch                                   2.4.0
torchvision                             0.19.0

Unfortunately setting tokenizer.model_max_length = 512 doesn't seem to give any performance boost in my case.

BTW thanks for asking about the SentenceTransformer issue I revisited it and was able to fix it by passing device parameter. Strangely the above mentioned issue in thread dump is gone when using below lines, earlier I was not passing device param :

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model = SentenceTransformer(self.model_path, trust_remote_code=True, device=device)

Also I must say SentenceTransformer way more faster than plain Transformers, I still don't know why ;)

Thanks!

From the code above, it doesn't seem like you are putting either the model or inputs on the GPU which could explain the slowness of transformers. IIRC, sentence transformers handles a lot of this especially when passing device. To verify this, you check the output of nvidia-smi when running the transformers code and it should show you some usage while your code is running.

Sign up or log in to comment