🪿 RWKV7 RWKV7 models 🪿 fla-hub/rwkv7-7.2B-g0a Text Generation • 7B • Updated Aug 30 • 17 • 3 fla-hub/rwkv7-7.2B-g0 Text Generation • 7B • Updated Aug 6 • 10 • 3 fla-hub/rwkv7-2.9B-g1 Text Generation • 3B • Updated Aug 6 • 31 • 2 fla-hub/rwkv7-2.9B-world Text Generation • 3B • Updated May 7 • 59 • 4
GSA fla-hub/gsa-1.3B-100B Text Generation • 1B • Updated Feb 9 • 2 fla-hub/gsa-2.7B-100B Text Generation • 3B • Updated Feb 9 • 3 Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published Sep 11, 2024 • 20
Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published Sep 11, 2024 • 20
GLA fla-hub/gla-1.3B-100B Text Generation • 1B • Updated Sep 9 • 416 • 1 fla-hub/gla-2.7B-100B Text Generation • 3B • Updated Feb 9 • 76 Gated Linear Attention Transformers with Hardware-Efficient Training Paper • 2312.06635 • Published Dec 11, 2023 • 9
Gated Linear Attention Transformers with Hardware-Efficient Training Paper • 2312.06635 • Published Dec 11, 2023 • 9
RetNet fla-hub/retnet-1.3B-100B Text Generation • 1B • Updated Feb 9 • 163 • 1 fla-hub/retnet-2.7B-100B Text Generation • 3B • Updated Feb 9 • 70 • 1 Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172
Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172
HGRN fla-hub/hgrn-1.3B-100B Text Generation • 1B • Updated Feb 9 • 2 fla-hub/hgrn-2.7B-100B Text Generation • 3B • Updated Feb 9 • 2 Hierarchically Gated Recurrent Neural Network for Sequence Modeling Paper • 2311.04823 • Published Nov 8, 2023 • 2
Hierarchically Gated Recurrent Neural Network for Sequence Modeling Paper • 2311.04823 • Published Nov 8, 2023 • 2
Qwen2.5 fla-hub/transformer-1.5B-qwen2.5 2B • Updated Feb 13 • 7 • 1 fla-hub/transformer-1.5B-qwen2.5-instruct 2B • Updated Feb 13 • 2 fla-hub/transformer-3B-qwen2.5 3B • Updated Feb 13 • 2 fla-hub/transformer-3B-qwen2.5-instruct 3B • Updated Feb 13 • 2
DeltaNet fla-hub/delta_net-1.3B-100B Text Generation • 1B • Updated Feb 9 • 156 fla-hub/delta_net-2.7B-100B Text Generation • 3B • Updated Feb 9 • 85 • 1 Parallelizing Linear Transformers with the Delta Rule over Sequence Length Paper • 2406.06484 • Published Jun 10, 2024 • 4
Parallelizing Linear Transformers with the Delta Rule over Sequence Length Paper • 2406.06484 • Published Jun 10, 2024 • 4
HGRN2 fla-hub/hgrn2-1.3B-100B Text Generation • 1B • Updated Feb 9 • 130 fla-hub/hgrn2-2.7B-100B Text Generation • 3B • Updated Feb 9 • 69 HGRN2: Gated Linear RNNs with State Expansion Paper • 2404.07904 • Published Apr 11, 2024 • 20
RWKV6 fla-hub/rwkv6-7B-finch Text Generation • 8B • Updated Jun 13, 2024 • 4 • 1 fla-hub/rwkv6-1.6B-finch Text Generation • 2B • Updated Jun 13, 2024 • 4 • 1
Mamba fla-hub/mamba-1.3B-100B Text Generation • 1B • Updated Aug 31, 2024 • 64 fla-hub/mamba-2.7B-100B Text Generation • 3B • Updated Oct 1, 2024 • 63 • 2
Transformer++ fla-hub/transformer-1.3B-100B Text Generation • 1B • Updated Feb 9 • 133 fla-hub/transformer-2.7B-100B Text Generation • 3B • Updated Feb 9 • 64 fla-hub/transformer-7B-mistral Text Generation • 7B • Updated Feb 9 • 3
🔥 flame A collection of baselines trained by 🔥 flame fla-hub/transformer-340M-4K-0.5B-20480-lr3e-4-cosine 0.4B • Updated Mar 14 • 2 • 1 fla-hub/transformer-340M-4K-0.5B-20480-lr3e-4-decay0.1-sqrt 0.4B • Updated Mar 14 • 2
🪿 RWKV7 RWKV7 models 🪿 fla-hub/rwkv7-7.2B-g0a Text Generation • 7B • Updated Aug 30 • 17 • 3 fla-hub/rwkv7-7.2B-g0 Text Generation • 7B • Updated Aug 6 • 10 • 3 fla-hub/rwkv7-2.9B-g1 Text Generation • 3B • Updated Aug 6 • 31 • 2 fla-hub/rwkv7-2.9B-world Text Generation • 3B • Updated May 7 • 59 • 4
DeltaNet fla-hub/delta_net-1.3B-100B Text Generation • 1B • Updated Feb 9 • 156 fla-hub/delta_net-2.7B-100B Text Generation • 3B • Updated Feb 9 • 85 • 1 Parallelizing Linear Transformers with the Delta Rule over Sequence Length Paper • 2406.06484 • Published Jun 10, 2024 • 4
Parallelizing Linear Transformers with the Delta Rule over Sequence Length Paper • 2406.06484 • Published Jun 10, 2024 • 4
GSA fla-hub/gsa-1.3B-100B Text Generation • 1B • Updated Feb 9 • 2 fla-hub/gsa-2.7B-100B Text Generation • 3B • Updated Feb 9 • 3 Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published Sep 11, 2024 • 20
Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published Sep 11, 2024 • 20
HGRN2 fla-hub/hgrn2-1.3B-100B Text Generation • 1B • Updated Feb 9 • 130 fla-hub/hgrn2-2.7B-100B Text Generation • 3B • Updated Feb 9 • 69 HGRN2: Gated Linear RNNs with State Expansion Paper • 2404.07904 • Published Apr 11, 2024 • 20
GLA fla-hub/gla-1.3B-100B Text Generation • 1B • Updated Sep 9 • 416 • 1 fla-hub/gla-2.7B-100B Text Generation • 3B • Updated Feb 9 • 76 Gated Linear Attention Transformers with Hardware-Efficient Training Paper • 2312.06635 • Published Dec 11, 2023 • 9
Gated Linear Attention Transformers with Hardware-Efficient Training Paper • 2312.06635 • Published Dec 11, 2023 • 9
RWKV6 fla-hub/rwkv6-7B-finch Text Generation • 8B • Updated Jun 13, 2024 • 4 • 1 fla-hub/rwkv6-1.6B-finch Text Generation • 2B • Updated Jun 13, 2024 • 4 • 1
RetNet fla-hub/retnet-1.3B-100B Text Generation • 1B • Updated Feb 9 • 163 • 1 fla-hub/retnet-2.7B-100B Text Generation • 3B • Updated Feb 9 • 70 • 1 Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172
Retentive Network: A Successor to Transformer for Large Language Models Paper • 2307.08621 • Published Jul 17, 2023 • 172
Mamba fla-hub/mamba-1.3B-100B Text Generation • 1B • Updated Aug 31, 2024 • 64 fla-hub/mamba-2.7B-100B Text Generation • 3B • Updated Oct 1, 2024 • 63 • 2
HGRN fla-hub/hgrn-1.3B-100B Text Generation • 1B • Updated Feb 9 • 2 fla-hub/hgrn-2.7B-100B Text Generation • 3B • Updated Feb 9 • 2 Hierarchically Gated Recurrent Neural Network for Sequence Modeling Paper • 2311.04823 • Published Nov 8, 2023 • 2
Hierarchically Gated Recurrent Neural Network for Sequence Modeling Paper • 2311.04823 • Published Nov 8, 2023 • 2
Transformer++ fla-hub/transformer-1.3B-100B Text Generation • 1B • Updated Feb 9 • 133 fla-hub/transformer-2.7B-100B Text Generation • 3B • Updated Feb 9 • 64 fla-hub/transformer-7B-mistral Text Generation • 7B • Updated Feb 9 • 3
Qwen2.5 fla-hub/transformer-1.5B-qwen2.5 2B • Updated Feb 13 • 7 • 1 fla-hub/transformer-1.5B-qwen2.5-instruct 2B • Updated Feb 13 • 2 fla-hub/transformer-3B-qwen2.5 3B • Updated Feb 13 • 2 fla-hub/transformer-3B-qwen2.5-instruct 3B • Updated Feb 13 • 2
🔥 flame A collection of baselines trained by 🔥 flame fla-hub/transformer-340M-4K-0.5B-20480-lr3e-4-cosine 0.4B • Updated Mar 14 • 2 • 1 fla-hub/transformer-340M-4K-0.5B-20480-lr3e-4-decay0.1-sqrt 0.4B • Updated Mar 14 • 2