HRM-MoE: Hierarchical Recurrent Memory with Mixture of Experts
HRM-MoE is an experimental language model that combines:
- Hierarchical Recurrent Memory (HRM) architecture for deep reasoning
- Mixture of Experts (MoE) for efficient scaling
Model Description
This model integrates HRM's Specialist/Manager hierarchy into a Mixture of Experts framework, allowing different experts to specialize in various aspects of language understanding.
Architecture
- Model Size: 376,290,816 parameters
- Expert Parameters: 352,444,416
- Non-Expert Parameters: 23,846,400
- Embedding Dimension: 512
- Layers: 6
- Attention Heads: 8
- FFN Dimension: 2048
- Number of Experts: 8
- Experts per Token: 2
Expert Types
- GLU/GEGLU Experts (4): Standard gated linear units
- Pattern Experts (2): Deep FFN for pattern recognition
- Local Conv Experts (0): Local neighborhood operations
- HRM Experts (4): Hierarchical reasoning with Specialist/Manager modules
Training
- Dataset: wikimedia/wikipedia
- Training Epochs: 2
- Batch Size: 8 (effective: 64)
- Learning Rate: 0.001 โ 1e-05 (cosine)
- Mixed Precision: Disabled
Latest Performance (Epoch 1)
- Validation Loss:
3.5996 - Validation Perplexity:
36.58
Features
- Adaptive Routing: Gumbel-Softmax with temperature annealing
- Load Balancing: Importance, load, and entropy regularization
- Expert Specialization: Diverse expert types for different aspects of language
Usage
import torch
from transformers import T5Tokenizer
# Load tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small", use_fast=False)
# Load model (you'll need the model architecture from the repo)
# See: https://github.com/your-repo/hrm-moe
# Generate text
# (example code here)
Citation
If you use this model, please cite the original HRM paper:
@article{hrm2024,
title={Hierarchical Reasoning Model},
author={...},
journal={arXiv preprint},
year={2024}
}
License
Apache 2.0
๐ค Generated with HRM-MoE Training Script