Commit
·
31a500a
1
Parent(s):
f20d0dd
Update README.md (#6)
Browse files- Update README.md (1661329696483c2d43fae5f241c862deaa8b9dc6)
Co-authored-by: namespace-Pt <[email protected]>
README.md
CHANGED
|
@@ -1,10 +1,18 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
---
|
| 4 |
-
|
| 5 |
-
|
| 6 |
<h1 align="center">FlagEmbedding</h1>
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
<h4 align="center">
|
| 10 |
<p>
|
|
@@ -19,18 +27,18 @@ license: mit
|
|
| 19 |
<p>
|
| 20 |
</h4>
|
| 21 |
|
| 22 |
-
More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
|
| 23 |
-
|
| 24 |
|
| 25 |
[English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
|
|
|
|
|
|
| 29 |
|
| 30 |
************* 🌟**Updates**🌟 *************
|
| 31 |
-
- 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Paper](https://arxiv.org/pdf/2310.07554.pdf)
|
| 32 |
- 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
|
| 33 |
-
- 09/15/2023: The [
|
| 34 |
- 09/12/2023: New models:
|
| 35 |
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
| 36 |
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
|
@@ -72,29 +80,27 @@ And it also can be used in vector databases for LLMs.
|
|
| 72 |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
| 73 |
|
| 74 |
|
| 75 |
-
[1\]: If you need to search the relevant passages
|
| 76 |
|
| 77 |
-
[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
|
| 78 |
-
For
|
| 79 |
|
| 80 |
All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
|
| 81 |
-
If you cannot open the Huggingface Hub, you also
|
| 82 |
|
| 83 |
|
| 84 |
## Frequently asked questions
|
| 85 |
|
| 86 |
-
|
| 87 |
-
<summary>1. How to fine-tune bge embedding model?</summary>
|
| 88 |
|
| 89 |
-
<!-- ### How to fine-tune bge embedding model? -->
|
| 90 |
Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
|
| 91 |
Some suggestions:
|
| 92 |
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
|
|
|
|
| 93 |
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
|
| 94 |
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
|
| 95 |
|
| 96 |
-
|
| 97 |
-
</details>
|
| 98 |
|
| 99 |
<details>
|
| 100 |
<summary>2. The similarity score between two dissimilar sentences is higher than 0.5</summary>
|
|
@@ -134,7 +140,7 @@ In all cases, the documents/passages do not need to add the instruction.
|
|
| 134 |
|
| 135 |
### Usage for Embedding Model
|
| 136 |
|
| 137 |
-
Here are some examples
|
| 138 |
[FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
|
| 139 |
|
| 140 |
#### Using FlagEmbedding
|
|
@@ -366,11 +372,11 @@ See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for
|
|
| 366 |
|
| 367 |
### BAAI Embedding
|
| 368 |
|
| 369 |
-
We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale
|
| 370 |
**You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
|
| 371 |
We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
|
| 372 |
Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
|
| 373 |
-
|
| 374 |
|
| 375 |
|
| 376 |
|
|
@@ -381,8 +387,14 @@ which is more accurate than embedding model (i.e., bi-encoder) but more time-con
|
|
| 381 |
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
| 382 |
We train the cross-encoder on a multilingual pair data,
|
| 383 |
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
| 384 |
-
|
|
|
|
| 385 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 386 |
|
| 387 |
## Contact
|
| 388 |
If you have any question or suggestion related to this project, feel free to open an issue or pull request.
|
|
@@ -402,6 +414,15 @@ If you find this repository useful, please consider giving a star :star: and cit
|
|
| 402 |
archivePrefix={arXiv},
|
| 403 |
primaryClass={cs.CL}
|
| 404 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 405 |
```
|
| 406 |
|
| 407 |
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<h1 align="center">FlagEmbedding</h1>
|
| 2 |
+
<p align="center">
|
| 3 |
+
<a href="https://github.com/FlagOpen/FlagEmbedding">
|
| 4 |
+
<img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
|
| 5 |
+
</a>
|
| 6 |
+
<a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
|
| 7 |
+
<img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
|
| 8 |
+
</a>
|
| 9 |
+
<a href="https://huggingface.co/C-MTEB">
|
| 10 |
+
<img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
|
| 11 |
+
</a>
|
| 12 |
+
<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding">
|
| 13 |
+
<img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.1-red">
|
| 14 |
+
</a>
|
| 15 |
+
</p>
|
| 16 |
|
| 17 |
<h4 align="center">
|
| 18 |
<p>
|
|
|
|
| 27 |
<p>
|
| 28 |
</h4>
|
| 29 |
|
|
|
|
|
|
|
| 30 |
|
| 31 |
[English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
|
| 32 |
|
| 33 |
+
<span style="#FF69B4;"> **Hiring:** We're seeking experienced NLP researchers and intern students focusing on dense retrieval and retrieval-augmented LLMs. If you're interested, please feel free to reach out to us via email at zhengliu1026@gmail.com.</span>
|
| 34 |
+
|
| 35 |
+
FlagEmbedding can map any text to a low-dimensional dense vector, which can be used for tasks like retrieval, classification, clustering, and semantic search.
|
| 36 |
+
And it can also be used in vector databases for LLMs.
|
| 37 |
|
| 38 |
************* 🌟**Updates**🌟 *************
|
| 39 |
+
- 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Paper](https://arxiv.org/pdf/2310.07554.pdf) :fire:
|
| 40 |
- 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
|
| 41 |
+
- 09/15/2023: The [massive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
|
| 42 |
- 09/12/2023: New models:
|
| 43 |
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
| 44 |
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
|
|
|
| 80 |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
|
| 81 |
|
| 82 |
|
| 83 |
+
[1\]: If you need to search the relevant passages in a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
|
| 84 |
|
| 85 |
+
[2\]: Different from the embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
|
| 86 |
+
For example, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 documents to get the final top-3 results.
|
| 87 |
|
| 88 |
All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
|
| 89 |
+
If you cannot open the Huggingface Hub, you can also download the models at https://model.baai.ac.cn/models .
|
| 90 |
|
| 91 |
|
| 92 |
## Frequently asked questions
|
| 93 |
|
| 94 |
+
**1. How to fine-tune bge embedding model?**
|
|
|
|
| 95 |
|
|
|
|
| 96 |
Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
|
| 97 |
Some suggestions:
|
| 98 |
- Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
|
| 99 |
+
- In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_config.json), `--gradient_checkpointing`, etc.
|
| 100 |
- If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
|
| 101 |
- If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
|
| 102 |
|
| 103 |
+
|
|
|
|
| 104 |
|
| 105 |
<details>
|
| 106 |
<summary>2. The similarity score between two dissimilar sentences is higher than 0.5</summary>
|
|
|
|
| 140 |
|
| 141 |
### Usage for Embedding Model
|
| 142 |
|
| 143 |
+
Here are some examples of using `bge` models with
|
| 144 |
[FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
|
| 145 |
|
| 146 |
#### Using FlagEmbedding
|
|
|
|
| 372 |
|
| 373 |
### BAAI Embedding
|
| 374 |
|
| 375 |
+
We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pair data using contrastive learning.
|
| 376 |
**You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
|
| 377 |
We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
|
| 378 |
Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
|
| 379 |
+
For more training details for bge see [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
|
| 380 |
|
| 381 |
|
| 382 |
|
|
|
|
| 387 |
Therefore, it can be used to re-rank the top-k documents returned by embedding model.
|
| 388 |
We train the cross-encoder on a multilingual pair data,
|
| 389 |
The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
|
| 390 |
+
For more details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
|
| 391 |
+
|
| 392 |
|
| 393 |
+
### Our Contributors:
|
| 394 |
+
|
| 395 |
+
<a href="https://github.com/FlagOpen/FlagEmbedding/graphs/contributors">
|
| 396 |
+
<img src="https://contrib.rocks/image?repo=FlagOpen/FlagEmbedding" />
|
| 397 |
+
</a>
|
| 398 |
|
| 399 |
## Contact
|
| 400 |
If you have any question or suggestion related to this project, feel free to open an issue or pull request.
|
|
|
|
| 414 |
archivePrefix={arXiv},
|
| 415 |
primaryClass={cs.CL}
|
| 416 |
}
|
| 417 |
+
|
| 418 |
+
@misc{llm_embedder,
|
| 419 |
+
title={Retrieve Anything To Augment Large Language Models},
|
| 420 |
+
author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
|
| 421 |
+
year={2023},
|
| 422 |
+
eprint={2310.07554},
|
| 423 |
+
archivePrefix={arXiv},
|
| 424 |
+
primaryClass={cs.IR}
|
| 425 |
+
}
|
| 426 |
```
|
| 427 |
|
| 428 |
## License
|