SentenceTransformer based on sentence-transformers/all-mpnet-base-v2

This is a sentence-transformers model specifically trained for job title matching and similarity. It's finetuned from sentence-transformers/all-mpnet-base-v2 on a large dataset of job titles and their associated skills/requirements. The model maps job titles and descriptions to a 1024-dimensional dense vector space and can be used for semantic job title matching, job similarity search, and related HR/recruitment tasks.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-mpnet-base-v2
  • Maximum Sequence Length: 64 tokens
  • Output Dimensionality: 1024 tokens
  • Similarity Function: Cosine Similarity
  • Training Dataset: 5.5M+ job title - skills pairs
  • Primary Use Case: Job title matching and similarity
  • Performance: Achieves 0.6457 MAP on TalentCLEF benchmark

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 64, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Asym(
    (anchor-0): Dense({'in_features': 768, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
    (positive-0): Dense({'in_features': 768, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  )
)

Usage

Direct Usage (Sentence Transformers)

First install the required packages:

pip install -U sentence-transformers

Then you can load and use the model with the following code:

import torch
import numpy as np
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import batch_to_device, cos_sim

# Load the model
model = SentenceTransformer("TechWolf/JobBERT-v2")

def encode_batch(jobbert_model, texts):
    features = jobbert_model.tokenize(texts)
    features = batch_to_device(features, jobbert_model.device)
    features["text_keys"] = ["anchor"]
    with torch.no_grad():
        out_features = jobbert_model.forward(features)
    return out_features["sentence_embedding"].cpu().numpy()

def encode(jobbert_model, texts, batch_size: int = 8):
    # Sort texts by length and keep track of original indices
    sorted_indices = np.argsort([len(text) for text in texts])
    sorted_texts = [texts[i] for i in sorted_indices]
    
    embeddings = []
    
    # Encode in batches
    for i in tqdm(range(0, len(sorted_texts), batch_size)):
        batch = sorted_texts[i:i+batch_size]
        embeddings.append(encode_batch(jobbert_model, batch))
    
    # Concatenate embeddings and reorder to original indices
    sorted_embeddings = np.concatenate(embeddings)
    original_order = np.argsort(sorted_indices)
    return sorted_embeddings[original_order]

# Example usage
job_titles = [
    'Software Engineer',
    'Senior Software Developer',
    'Product Manager',
    'Data Scientist'
]

# Get embeddings
embeddings = encode(model, job_titles)

# Calculate cosine similarity matrix
similarities = cos_sim(embeddings, embeddings)
print(similarities)

The output will be a similarity matrix where each value represents the cosine similarity between two job titles:

tensor([[1.0000, 0.8723, 0.4821, 0.5447],
        [0.8723, 1.0000, 0.4822, 0.5019],
        [0.4821, 0.4822, 1.0000, 0.4328],
        [0.5447, 0.5019, 0.4328, 1.0000]])

In this example:

  • The diagonal values are 1.0000 (perfect similarity with itself)
  • 'Software Engineer' and 'Senior Software Developer' have high similarity (0.8723)
  • 'Product Manager' and 'Data Scientist' show lower similarity with other roles
  • All values range between 0 and 1, where higher values indicate greater similarity

Example Use Cases

  1. Job Title Matching: Find similar job titles for standardization or matching
  2. Job Search: Match job seekers with relevant positions based on title similarity
  3. HR Analytics: Analyze job title patterns and similarities across organizations
  4. Talent Management: Identify similar roles for career development and succession planning

Training Details

Training Dataset

generator

  • Dataset: 5.5M+ job title pairs
  • Format: Anchor job titles paired with related skills/requirements
  • Training objective: Learn semantic similarity between job titles and their associated skills
  • Loss: CachedMultipleNegativesRankingLoss with cosine similarity

Training Hyperparameters

  • Batch Size: 2048
  • Learning Rate: 5e-05
  • Epochs: 1
  • FP16 Training: Enabled
  • Optimizer: AdamW

Framework Versions

  • Python: 3.9.19
  • Sentence Transformers: 3.1.0
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu118
  • Accelerate: 0.34.2
  • Datasets: 3.0.0
  • Tokenizers: 0.19.1

Citation

BibTeX

JobBERT-v2 paper

Please cite this paper when using JobBERT-v2:

@article{01K47W55SG7ZRKFG431ESRXC35,
  abstract     = {{Labor market analysis relies on extracting insights from job advertisements, which provide valuable yet unstructured information on job titles and corresponding skill requirements. While state-of-the-art methods for skill extraction achieve strong performance, they depend on large language models (LLMs), which are computationally expensive and slow. In this paper, we propose ConTeXT-match, a novel contrastive learning approach with token-level attention that is well-suited for the extreme multi-label classification task of skill classification. ConTeXT-match significantly improves skill extraction efficiency and performance, achieving state-of-the-art results with a lightweight bi-encoder model. To support robust evaluation, we introduce Skill-XL a new benchmark with exhaustive, sentence-level skill annotations that explicitly address the redundancy in the large label space. Finally, we present JobBERT V2, an improved job title normalization model that leverages extracted skills to produce high-quality job title representations. Experiments demonstrate that our models are efficient, accurate, and scalable, making them ideal for large-scale, real-time labor market analysis.}},
  author       = {{Decorte, Jens-Joris and Van Hautte, Jeroen and Develder, Chris and Demeester, Thomas}},
  issn         = {{2169-3536}},
  journal      = {{IEEE ACCESS}},
  keywords     = {{Taxonomy,Contrastive learning,Training,Annotations,Benchmark testing,Training data,Large language models,Computational efficiency,Accuracy,Terminology,Labor market analysis,text encoders,skill extraction,job title normalization}},
  language     = {{eng}},
  pages        = {{133596--133608}},
  title        = {{Efficient text encoders for labor market analysis}},
  url          = {{http://doi.org/10.1109/ACCESS.2025.3589147}},
  volume       = {{13}},
  year         = {{2025}},
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
358,195
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW

Model tree for TechWolf/JobBERT-v2

Finetuned
(305)
this model

Spaces using TechWolf/JobBERT-v2 2

Collection including TechWolf/JobBERT-v2