TafBERTa

Overview

TafBERTa is a compact RoBERTa-style masked-language model trained on Hebrew child-directed speech (CDS).
It was developed to investigate whether a small-scale model exposed to realistic input to children can learn core grammatical regularities of Hebrew.

TafBERTa is built on two key resources:

HTBerman — a curated Hebrew CDS corpus (53 K sentences, 233 K tokens).
HeCLiMP — a grammatical minimal-pairs benchmark evaluating Determiner–Noun agreement in number and gender.

Despite having only 3.3 M parameters and being trained on just 1.8 MB of text, TafBERTa achieves grammatical performance close to large-scale Hebrew LMs such as HeRo.

Loading the model

TafBERTa was trained with add_prefix_space=True, so it must always be loaded with this flag:

from transformers import RobertaForMaskedLM, AutoTokenizer

repo = "geanita/TafBERTa"

tokenizer = AutoTokenizer.from_pretrained(repo, add_prefix_space=True)
model = RobertaForMaskedLM.from_pretrained(repo)

Hyper-parameters

Architecture: RoBERTa (3.3 M params, 10 layers, 4 heads, hidden size = 64)
Vocabulary: 7 317 subwords (trained on HTBerman)
Max sequence: 128
Training objective: Masked-Language Modeling
Batch size: 64
Epochs: 5
Hardware: 1 × RTX 6000
Training time: ≈ 105 s

Performance on HeCLiMP

HeCLiMP (Hebrew Child-Directed Linguistic Minimal Pairs) tests grammatical knowledge using controlled paradigms of agreement between a determiner and a noun, generating correct / incorrect pairs such as:

תסתכל על ה <NOUN> הזה .
תסתכל על ה <NOUN> הזאת .

Overall Accuracy on HeCLiMP

Model	Accuracy (holistic)	Accuracy (MLM-scoring)
TafBERTa	69.3 %	69.1 %
HeRo	73.5 %	73.2 %

TafBERTa reaches near-HeRo accuracy while being over 35 × smaller and trained on ~1/2000 of the data.

Corpus — HTBerman

HTBerman is a cleaned, aligned, and morphologically annotated corpus of Hebrew child-directed speech.
It merges the CHILDES Berman longitudinal corpus (Latin transcription) with its Hebrew-script version and aligns tokens across both.

53 K sentences, 233 K tokens, ~8 K unique types
Includes morphological tags (%mor) and speaker metadata
Retains only caregiver speech (adult CDS)
Used for training and for building HeCLiMP minimal pairs

Model Characteristics

Property	TafBERTa	HeRo (reference)
Parameters	3.3 M	125 M
Vocabulary size	7.3 K	50 K
Data size	1.8 MB	47.5 GB
Tokens seen	233 K	4.7 B
Max sequence	128	512
Epochs	5	25
Accuracy on HeCLiMP	69.3 %	73.5 %

Citation

If you use TafBERTa, HTBerman, or HeCLiMP, please cite:

TafBERTa: Learning Grammatical Rules from Small-Scale Language Acquisition Data in Hebrew. [Preprint / ACL submission]

Acknowledgments

Developed by Anita Gelboim under the supervision of Dr. Elior Sulem (Ben Gurion University of the Negev).
The project investigates computational modeling of Hebrew language acquisition from developmentally plausible input.
Data derived from CHILDES / TalkBank Hebrew corpora, used under their respective licenses.

For code and reproducibility: GitHub – geanita/TafBERTa

Downloads last month: 1

Safetensors

Model size

3.3M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support