## Base model evaluation timestamp: 2025-10-14 01:41:37 - Model: base_model (step 21400) - CORE metric: 0.1976 - hellaswag_zeroshot: 0.2598 - jeopardy: 0.0874 - bigbench_qa_wikidata: 0.5113 - arc_easy: 0.5354 - arc_challenge: 0.1183 - copa: 0.2800 - commonsense_qa: 0.0796 - piqa: 0.3798 - openbook_qa: 0.1627 - lambada_openai: 0.3839 - hellaswag: 0.2595 - winograd: 0.2821 - winogrande: 0.0513 - bigbench_dyck_languages: 0.1430 - agi_eval_lsat_ar: 0.1304 - bigbench_cs_algorithms: 0.3727 - bigbench_operators: 0.1762 - bigbench_repeat_copy_logic: 0.0312 - squad: 0.2389 - coqa: 0.2088 - boolq: -0.5218 - bigbench_language_identification: 0.1757