## Base model evaluation
timestamp: 2025-10-14 01:41:37

- Model: base_model (step 21400)
- CORE metric: 0.1976
- hellaswag_zeroshot: 0.2598
- jeopardy: 0.0874
- bigbench_qa_wikidata: 0.5113
- arc_easy: 0.5354
- arc_challenge: 0.1183
- copa: 0.2800
- commonsense_qa: 0.0796
- piqa: 0.3798
- openbook_qa: 0.1627
- lambada_openai: 0.3839
- hellaswag: 0.2595
- winograd: 0.2821
- winogrande: 0.0513
- bigbench_dyck_languages: 0.1430
- agi_eval_lsat_ar: 0.1304
- bigbench_cs_algorithms: 0.3727
- bigbench_operators: 0.1762
- bigbench_repeat_copy_logic: 0.0312
- squad: 0.2389
- coqa: 0.2088
- boolq: -0.5218
- bigbench_language_identification: 0.1757