HippolyteP nielsr HF Staff commited on
Commit
d430018
·
verified ·
1 Parent(s): d772c4d

Add pipeline_tag: feature-extraction, Code link, and Usage section (#1)

Browse files

- Add pipeline_tag: feature-extraction, Code link, and Usage section (cf1733d2958d9e35886865da9ef712853bcc4302)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +19 -4
README.md CHANGED
@@ -1,21 +1,23 @@
1
  ---
2
- license: cc-by-4.0
3
  language:
4
  - en
 
5
  tags:
6
  - model_hub_mixin
7
  - pytorch_model_hub_mixin
 
8
  ---
9
 
10
- # ARC-Encoder models
11
 
12
- This page houses `ARC8-Encoder_multi` from three different versions of pretrained ARC-Encoders. Architectures and methods to train them are described in the paper *ARC-Encoder: learning compressed text representations for large language models* available [here](https://arxiv.org/abs/2510.20535). A code to reproduce the pretraining, further fine-tune the encoders or even evaluate them on dowstream tasks is available at [ARC-Encoder repository](https://github.com/kyutai-labs/ARC-Encoder/tree/main).
 
13
 
14
  ## Models Details
15
 
16
  All the encoders released here are trained on web crawl filtered using [Dactory](https://github.com/kyutai-labs/dactory) based on a [Llama3.2-3B](https://github.com/meta-llama/llama-cookbook) base backbone. It consists in two ARC-Encoder specifically trained for one decoder and one for two decoders in the same time:
17
  - `ARC8-Encoder_Llama`, trained on 2.6B tokens on [Llama3.1-8B](https://github.com/meta-llama/llama-cookbook) base specifically with a pooling factor of 8.
18
- - `ARC8-Encoder_Mistral`, trained on 2.6B tokens on [Mistral-7B](https://github.com/mistralai/mistral-finetune?tab=readme-ov-file) base specifically with a pooling factor of 8.
19
  - `ARC8-Encoder_multi`, trained by sampling among the two decoders with a pooling factor of 8.
20
 
21
  ### Uses
@@ -31,6 +33,19 @@ To reproduce the results presented in the paper, you can use our released fine-t
31
 
32
  Terms of use: As the released models are pretrained from Llama3.2 3B backbone, ARC-Encoders are subject to the Llama Terms of Use found at [Llama license](https://www.llama.com/license/).
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ## Citations
35
 
36
  If you use one of these models, please cite:
 
1
  ---
 
2
  language:
3
  - en
4
+ license: cc-by-4.0
5
  tags:
6
  - model_hub_mixin
7
  - pytorch_model_hub_mixin
8
+ pipeline_tag: feature-extraction
9
  ---
10
 
11
+ # ARC-Encoder models
12
 
13
+ This page houses `ARC8-Encoder_multi` from three different versions of pretrained ARC-Encoders. Architectures and methods to train them are described in the paper *ARC-Encoder: learning compressed text representations for large language models* available [here](https://arxiv.org/abs/2510.20535).
14
+ Code: [ARC-Encoder repository](https://github.com/kyutai-labs/ARC-Encoder)
15
 
16
  ## Models Details
17
 
18
  All the encoders released here are trained on web crawl filtered using [Dactory](https://github.com/kyutai-labs/dactory) based on a [Llama3.2-3B](https://github.com/meta-llama/llama-cookbook) base backbone. It consists in two ARC-Encoder specifically trained for one decoder and one for two decoders in the same time:
19
  - `ARC8-Encoder_Llama`, trained on 2.6B tokens on [Llama3.1-8B](https://github.com/meta-llama/llama-cookbook) base specifically with a pooling factor of 8.
20
+ - `ARC8-Encoder_Mistral`, trained on 2.6B tokens on [Mistral-7B](https://www.mistralai.com/news/announcing-mistral-7b/) base specifically with a pooling factor of 8.
21
  - `ARC8-Encoder_multi`, trained by sampling among the two decoders with a pooling factor of 8.
22
 
23
  ### Uses
 
33
 
34
  Terms of use: As the released models are pretrained from Llama3.2 3B backbone, ARC-Encoders are subject to the Llama Terms of Use found at [Llama license](https://www.llama.com/license/).
35
 
36
+ ## Usage
37
+
38
+ To load the pre-trained ARC-Encoders, use the following code snippet from the [ARC-Encoder repository](https://github.com/kyutai-labs/ARC-Encoder):
39
+
40
+ ```python
41
+ from embed_llm.models.augmented_model import load_and_save_released_models
42
+
43
+ # ARC8_Encoder_multi, ARC8_Encoder_Llama or ARC8_Encoder_Mistral
44
+ load_and_save_released_models(ARC8_Encoder_Llama, hf_token=<HF_TOKEN>)
45
+ ```
46
+
47
+ ***Remark:*** This code snippet loads the model from Hugging Face and then creates appropriate folders at `<TMP_PATH>` containing the checkpoint and additional necessary files for fine-tuning or evaluation with the `ARC-Encoder` codebase. To reduce occupied memory space, you can then delete the model from your Hugging Face cache.
48
+
49
  ## Citations
50
 
51
  If you use one of these models, please cite: