|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen3-1.7B |
|
|
datasets: |
|
|
- codefuse-ai/F2LLM |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data |
|
|
|
|
|
This model is presented in the paper [F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data](https://huggingface.co/papers/2510.02294). |
|
|
The code for this model is available on [GitHub](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM). |
|
|
|
|
|
F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines. |
|
|
|
|
|
## Usage |
|
|
|
|
|
To encode a batch of sentences: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
|
|
|
|
|
|
model_path = "codefuse-ai/F2LLM-1.7B" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0}) |
|
|
|
|
|
sentences = [ |
|
|
'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.', |
|
|
'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.' |
|
|
] |
|
|
|
|
|
def encode(sentences): |
|
|
batch_size = len(sentences) |
|
|
sentences = [s+tokenizer.eos_token for s in sentences] |
|
|
tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt', add_special_tokens=False).to(model.device) |
|
|
last_hidden_state = model(**tokenized_inputs).last_hidden_state |
|
|
eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1 |
|
|
embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions] |
|
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
|
return embeddings |
|
|
|
|
|
embeddings = encode(sentences) |
|
|
``` |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
To evaluate F2LLMs on MTEB (currently requires installing MTEB from source): |
|
|
|
|
|
```python |
|
|
import mteb |
|
|
import logging |
|
|
logging.basicConfig(level=logging.INFO) |
|
|
|
|
|
task_names = ['AmazonCounterfactualClassification', 'ArXivHierarchicalClusteringP2P', 'ArXivHierarchicalClusteringS2S', 'ArguAna', 'AskUbuntuDupQuestions', 'BIOSSES', 'Banking77Classification', 'BiorxivClusteringP2P.v2', 'CQADupstackGamingRetrieval', 'CQADupstackUnixRetrieval', 'ClimateFEVERHardNegatives', 'FEVERHardNegatives', 'FiQA2018', 'HotpotQAHardNegatives', 'ImdbClassification', 'MTOPDomainClassification', 'MassiveIntentClassification', 'MassiveScenarioClassification', 'MedrxivClusteringP2P.v2', 'MedrxivClusteringS2S.v2', 'SCIDOCS', 'SICK-R', 'STS12', 'STS13', 'STS14', 'STS15', 'STS17', 'STS22.v2', 'STSBenchmark', 'SprintDuplicateQuestions', 'StackExchangeClustering.v2', 'StackExchangeClusteringP2P.v2', 'SummEvalSummarization.v2', 'TRECCOVID', 'Touche2020Retrieval.v3', 'ToxicConversationsClassification', 'TweetSentimentExtractionClassification', 'TwentyNewsgroupsClustering.v2', 'TwitterSemEval2015', 'TwitterURLCorpus', 'MindSmallReranking'] |
|
|
|
|
|
tasks = [ |
|
|
mteb.get_task(task_name, languages = ["eng"], eval_splits=["test"], exclusive_language_filter=True) |
|
|
for task_name in task_names |
|
|
] |
|
|
|
|
|
|
|
|
model = mteb.get_model("codefuse-ai/F2LLM-1.7B", device="cuda:0") |
|
|
evaluation = mteb.MTEB(tasks=tasks) |
|
|
evaluation.run(model, encode_kwargs={"batch_size": 16}) |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
Training code is available in our [Github repo](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use the F2LLM models, data, or code, please cite the following technical report. |
|
|
|
|
|
``` |
|
|
@article{2025F2LLM, |
|
|
title={F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data}, |
|
|
author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang}, |
|
|
journal = {CoRR}, |
|
|
volume = {abs/2510.02294}, |
|
|
year = {2025}, |
|
|
url = {https://doi.org/10.48550/arXiv.2510.02294}, |
|
|
doi = {10.48550/ARXIV.2510.02294}, |
|
|
eprinttype = {arXiv}, |
|
|
eprint = {2510.02294} |
|
|
} |
|
|
``` |