TafBERTa

Overview

TafBERTa is a compact RoBERTa-style masked-language model trained on Hebrew child-directed speech (CDS).
It was developed to investigate whether a small-scale model exposed to realistic input to children can learn core grammatical regularities of Hebrew.

TafBERTa is built on two key resources:

  • HTBerman — a curated Hebrew CDS corpus (53 K sentences, 233 K tokens).
  • HeCLiMP — a grammatical minimal-pairs benchmark evaluating Determiner–Noun agreement in number and gender.

Despite having only 3.3 M parameters and being trained on just 1.8 MB of text, TafBERTa achieves grammatical performance close to large-scale Hebrew LMs such as HeRo.


Loading the model

TafBERTa was trained with add_prefix_space=True, so it must always be loaded with this flag:

from transformers import RobertaForMaskedLM, AutoTokenizer

repo = "geanita/TafBERTa"

tokenizer = AutoTokenizer.from_pretrained(repo, add_prefix_space=True)
model = RobertaForMaskedLM.from_pretrained(repo)

Hyper-parameters

  • Architecture: RoBERTa (3.3 M params, 10 layers, 4 heads, hidden size = 64)
  • Vocabulary: 7 317 subwords (trained on HTBerman)
  • Max sequence: 128
  • Training objective: Masked-Language Modeling
  • Batch size: 64
  • Epochs: 5
  • Hardware: 1 × RTX 6000
  • Training time: ≈ 105 s

Performance on HeCLiMP

HeCLiMP (Hebrew Child-Directed Linguistic Minimal Pairs) tests grammatical knowledge using controlled paradigms of agreement between a determiner and a noun, generating correct / incorrect pairs such as:

תסתכל על ה <NOUN> הזה .
תסתכל על ה <NOUN> הזאת .

Overall Accuracy on HeCLiMP

Model Accuracy (holistic) Accuracy (MLM-scoring)
TafBERTa 69.3 % 69.1 %
HeRo 73.5 % 73.2 %

TafBERTa reaches near-HeRo accuracy while being over 35 × smaller and trained on ~1/2000 of the data.


Corpus — HTBerman

HTBerman is a cleaned, aligned, and morphologically annotated corpus of Hebrew child-directed speech.
It merges the CHILDES Berman longitudinal corpus (Latin transcription) with its Hebrew-script version and aligns tokens across both.

  • 53 K sentences, 233 K tokens, ~8 K unique types
  • Includes morphological tags (%mor) and speaker metadata
  • Retains only caregiver speech (adult CDS)
  • Used for training and for building HeCLiMP minimal pairs

Model Characteristics

Property TafBERTa HeRo (reference)
Parameters 3.3 M 125 M
Vocabulary size 7.3 K 50 K
Data size 1.8 MB 47.5 GB
Tokens seen 233 K 4.7 B
Max sequence 128 512
Epochs 5 25
Accuracy on HeCLiMP 69.3 % 73.5 %

Citation

If you use TafBERTa, HTBerman, or HeCLiMP, please cite:

TafBERTa: Learning Grammatical Rules from Small-Scale Language Acquisition Data in Hebrew. [Preprint / ACL submission]


Acknowledgments

Developed by Anita Gelboim under the supervision of Dr. Elior Sulem (Ben Gurion University of the Negev).
The project investigates computational modeling of Hebrew language acquisition from developmentally plausible input.
Data derived from CHILDES / TalkBank Hebrew corpora, used under their respective licenses.

For code and reproducibility: GitHub – geanita/TafBERTa

Downloads last month
1
Safetensors
Model size
3.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support