QRWKV in, Qwerky out
#4
by
						
wxgeorge
	
							
						- opened
							
					
    	
        README.md
    CHANGED
    
    | @@ -7,13 +7,13 @@ library_name: transformers | |
| 7 |  | 
| 8 | 
             
            
         | 
| 9 |  | 
| 10 | 
            -
            - Try out the model on [](https://substack.recursal.ai/p/qwerky-72b-and-32b-training-large)
         | 
| 12 | 
             
            - This model was presented in [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
         | 
| 13 |  | 
| 14 | 
            -
            Benchmarks is as follows for both  | 
| 15 |  | 
| 16 | 
            -
            | Tasks | Metric |  | 
| 17 | 
             
            |:---:|:---:|:---:|:---:|:---:|:---:|
         | 
| 18 | 
             
            | arc_challenge | acc_norm | **0.5640** | 0.5563 | **0.6382** | 0.6323 |
         | 
| 19 | 
             
            | arc_easy | acc_norm | 0.7837 | **0.7866** | **0.8443** | 0.8329 |
         | 
| @@ -33,7 +33,7 @@ Since this model is not on transformers at the moment you will have to enable re | |
| 33 | 
             
            ```py
         | 
| 34 | 
             
            # ...
         | 
| 35 |  | 
| 36 | 
            -
            model = AutoModelForCausalLM.from_pretrained("featherless-ai/ | 
| 37 |  | 
| 38 | 
             
            # ...
         | 
| 39 | 
             
            ```
         | 
| @@ -43,7 +43,7 @@ Other than enabling remote code, you may run the model like a regular model with | |
| 43 | 
             
            ```py
         | 
| 44 | 
             
            from transformers import AutoModelForCausalLM, AutoTokenizer
         | 
| 45 |  | 
| 46 | 
            -
            model_name = "featherless-ai/ | 
| 47 |  | 
| 48 | 
             
            model = AutoModelForCausalLM.from_pretrained(
         | 
| 49 | 
             
                model_name,
         | 
| @@ -79,7 +79,7 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] | |
| 79 |  | 
| 80 | 
             
            Linear models offer a promising approach to significantly reduce computational costs at scale, particularly for large context lengths. Enabling a >1000x improvement in inference costs, enabling o1 inference time thinking and wider AI accessibility.
         | 
| 81 |  | 
| 82 | 
            -
            As demonstrated with our  | 
| 83 |  | 
| 84 | 
             
            As with our previous models, the model's inherent knowledge and dataset training are inherited from its "parent" model. Consequently, unlike previous RWKV models trained on over 100+ languages, the QRWKV model is limited to approximately 30 languages supported by the Qwen line of models.
         | 
| 85 |  | 
|  | |
| 7 |  | 
| 8 | 
             
            
         | 
| 9 |  | 
| 10 | 
            +
            - Try out the model on [](https://featherless.ai/models/featherless-ai/QRWKV-72B)
         | 
| 11 | 
             
            - Model details from our blog post here! [](https://substack.recursal.ai/p/qwerky-72b-and-32b-training-large)
         | 
| 12 | 
             
            - This model was presented in [RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale](https://huggingface.co/papers/2505.03005).
         | 
| 13 |  | 
| 14 | 
            +
            Benchmarks is as follows for both QRWKV-QwQ-32B and QRWKV-72B models:
         | 
| 15 |  | 
| 16 | 
            +
            | Tasks | Metric | QRWKV-QwQ-32B | Qwen/QwQ-32B | QRWKV-72B | Qwen2.5-72B-Instruct |
         | 
| 17 | 
             
            |:---:|:---:|:---:|:---:|:---:|:---:|
         | 
| 18 | 
             
            | arc_challenge | acc_norm | **0.5640** | 0.5563 | **0.6382** | 0.6323 |
         | 
| 19 | 
             
            | arc_easy | acc_norm | 0.7837 | **0.7866** | **0.8443** | 0.8329 |
         | 
|  | |
| 33 | 
             
            ```py
         | 
| 34 | 
             
            # ...
         | 
| 35 |  | 
| 36 | 
            +
            model = AutoModelForCausalLM.from_pretrained("featherless-ai/QRWKV-72B", trust_remote_code=True)
         | 
| 37 |  | 
| 38 | 
             
            # ...
         | 
| 39 | 
             
            ```
         | 
|  | |
| 43 | 
             
            ```py
         | 
| 44 | 
             
            from transformers import AutoModelForCausalLM, AutoTokenizer
         | 
| 45 |  | 
| 46 | 
            +
            model_name = "featherless-ai/QRWKV-72B"
         | 
| 47 |  | 
| 48 | 
             
            model = AutoModelForCausalLM.from_pretrained(
         | 
| 49 | 
             
                model_name,
         | 
|  | |
| 79 |  | 
| 80 | 
             
            Linear models offer a promising approach to significantly reduce computational costs at scale, particularly for large context lengths. Enabling a >1000x improvement in inference costs, enabling o1 inference time thinking and wider AI accessibility.
         | 
| 81 |  | 
| 82 | 
            +
            As demonstrated with our QRWKV-72B-Preview and prior models such as QRWKV6-32B Instruct Preview, we have successfully converted Qwen 2.5 72B into a RWKV variant without requiring a pretrain on the base model or retraining the model from scratch. Enabling us to test and validate the more efficient RWKV Linear attention with a much smaller budget. Since our preview, we have continued to refine our technique and managed to improve the model over the preview model iteration.
         | 
| 83 |  | 
| 84 | 
             
            As with our previous models, the model's inherent knowledge and dataset training are inherited from its "parent" model. Consequently, unlike previous RWKV models trained on over 100+ languages, the QRWKV model is limited to approximately 30 languages supported by the Qwen line of models.
         | 
| 85 |  | 
