TafBERTa
Overview
TafBERTa is a compact RoBERTa-style masked-language model trained on Hebrew child-directed speech (CDS).
It was developed to investigate whether a small-scale model exposed to realistic input to children can learn core grammatical regularities of Hebrew.
TafBERTa is built on two key resources:
- HTBerman — a curated Hebrew CDS corpus (53 K sentences, 233 K tokens).
- HeCLiMP — a grammatical minimal-pairs benchmark evaluating Determiner–Noun agreement in number and gender.
Despite having only 3.3 M parameters and being trained on just 1.8 MB of text, TafBERTa achieves grammatical performance close to large-scale Hebrew LMs such as HeRo.
Loading the model
TafBERTa was trained with add_prefix_space=True, so it must always be loaded with this flag:
from transformers import RobertaForMaskedLM, AutoTokenizer
repo = "geanita/TafBERTa"
tokenizer = AutoTokenizer.from_pretrained(repo, add_prefix_space=True)
model = RobertaForMaskedLM.from_pretrained(repo)
Hyper-parameters
- Architecture: RoBERTa (3.3 M params, 10 layers, 4 heads, hidden size = 64)
- Vocabulary: 7 317 subwords (trained on HTBerman)
- Max sequence: 128
- Training objective: Masked-Language Modeling
- Batch size: 64
- Epochs: 5
- Hardware: 1 × RTX 6000
- Training time: ≈ 105 s
Performance on HeCLiMP
HeCLiMP (Hebrew Child-Directed Linguistic Minimal Pairs) tests grammatical knowledge using controlled paradigms of agreement between a determiner and a noun, generating correct / incorrect pairs such as:
תסתכל על ה <NOUN> הזה .
תסתכל על ה <NOUN> הזאת .
Overall Accuracy on HeCLiMP
| Model | Accuracy (holistic) | Accuracy (MLM-scoring) |
|---|---|---|
| TafBERTa | 69.3 % | 69.1 % |
| HeRo | 73.5 % | 73.2 % |
TafBERTa reaches near-HeRo accuracy while being over 35 × smaller and trained on ~1/2000 of the data.
Corpus — HTBerman
HTBerman is a cleaned, aligned, and morphologically annotated corpus of Hebrew child-directed speech.
It merges the CHILDES Berman longitudinal corpus (Latin transcription) with its Hebrew-script version and aligns tokens across both.
- 53 K sentences, 233 K tokens, ~8 K unique types
- Includes morphological tags (
%mor) and speaker metadata - Retains only caregiver speech (adult CDS)
- Used for training and for building HeCLiMP minimal pairs
Model Characteristics
| Property | TafBERTa | HeRo (reference) |
|---|---|---|
| Parameters | 3.3 M | 125 M |
| Vocabulary size | 7.3 K | 50 K |
| Data size | 1.8 MB | 47.5 GB |
| Tokens seen | 233 K | 4.7 B |
| Max sequence | 128 | 512 |
| Epochs | 5 | 25 |
| Accuracy on HeCLiMP | 69.3 % | 73.5 % |
Citation
If you use TafBERTa, HTBerman, or HeCLiMP, please cite:
TafBERTa: Learning Grammatical Rules from Small-Scale Language Acquisition Data in Hebrew. [Preprint / ACL submission]
Acknowledgments
Developed by Anita Gelboim under the supervision of Dr. Elior Sulem (Ben Gurion University of the Negev).
The project investigates computational modeling of Hebrew language acquisition from developmentally plausible input.
Data derived from CHILDES / TalkBank Hebrew corpora, used under their respective licenses.
For code and reproducibility: GitHub – geanita/TafBERTa
- Downloads last month
- 1