model_trace

Runtime error

App Files Files Community

Ahmed Ahmed commited on Jul 26

Commit

1bac1ed

1 Parent(s): 36b1a23

try again

Browse files

Files changed (2) hide show

logs.txt +87 -366
src/leaderboard/read_evals.py +3 -5

logs.txt CHANGED Viewed

@@ -1,393 +1,114 @@
-Searching for result files in: ./eval-results
-Found 7 result files
-Processing file: ./eval-results/EleutherAI/results_EleutherAI_gpt-neo-1.3B_20250726_010247.json
-config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]
-config.json: 100%|██████████| 1.35k/1.35k [00:00<00:00, 17.2MB/s]
-Created result object for: EleutherAI/gpt-neo-1.3B
-Added new result for EleutherAI_gpt-neo-1.3B_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_231201.json
-config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]
-config.json: 100%|██████████| 665/665 [00:00<00:00, 8.83MB/s]
-Created result object for: openai-community/gpt2
-Added new result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_233155.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235115.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235748.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000358.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000650.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing 2 evaluation results
-Converting result to dict for: EleutherAI/gpt-neo-1.3B
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: EleutherAI/gpt-neo-1.3B
-Raw results: {'perplexity': 5.9609375}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 5.9609375
-Converted score: 82.1477223263516
-Calculated average score: 82.1477223263516
-Created base data_dict with 13 columns
-Added task score: Perplexity = 5.9609375
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully converted and added result
-Converting result to dict for: openai-community/gpt2
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: openai-community/gpt2
-Raw results: {'perplexity': 20.663532257080078}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 20.663532257080078
-Converted score: 69.7162958010531
-Calculated average score: 69.7162958010531
-Created base data_dict with 13 columns
-Added task score: Perplexity = 20.663532257080078
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully converted and added result
-Returning 2 processed results
-Found 2 raw results
-Processing result 1/2: EleutherAI/gpt-neo-1.3B
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: EleutherAI/gpt-neo-1.3B
-Raw results: {'perplexity': 5.9609375}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 5.9609375
-Converted score: 82.1477223263516
-Calculated average score: 82.1477223263516
-Created base data_dict with 13 columns
-Added task score: Perplexity = 5.9609375
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully processed result 1/2: EleutherAI/gpt-neo-1.3B
-Processing result 2/2: openai-community/gpt2
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: openai-community/gpt2
-Raw results: {'perplexity': 20.663532257080078}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 20.663532257080078
-Converted score: 69.7162958010531
-Calculated average score: 69.7162958010531
-Created base data_dict with 13 columns
-Added task score: Perplexity = 20.663532257080078
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully processed result 2/2: openai-community/gpt2
-Converted to 2 JSON records
-Sample record keys: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-Created DataFrame with columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-DataFrame shape: (2, 14)
-Sorted DataFrame by average
-Selected and rounded columns
-Final DataFrame shape after filtering: (2, 12)
-Final columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
-=== FINAL RESULT: DataFrame with 2 rows and 12 columns ===
-=== Initializing Leaderboard ===
-DataFrame shape: (2, 12)
-DataFrame columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
-* Running on local URL:  http://0.0.0.0:7860, with SSR ⚡ (experimental, to disable set `ssr=False` in `launch()`)
-To create a public link, set `share=True` in `launch()`.
-=== RUNNING PERPLEXITY TEST ===
-Model: openai-community/gpt2-large
-Revision: main
-Precision: float16
-Starting dynamic evaluation for openai-community/gpt2-large
-Running perplexity evaluation...
-Loading model: openai-community/gpt2-large (revision: main)
-Loading tokenizer...
-tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]
-tokenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 183kB/s]
-config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]
-config.json: 100%|██████████| 666/666 [00:00<00:00, 7.11MB/s]
-vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]
-vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 45.7MB/s]
-merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
-merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 44.9MB/s]
-tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]
-tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 25.3MB/s]
-Tokenizer loaded successfully
-Loading model...
-model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]
-model.safetensors:   0%|          | 3.99M/3.25G [00:01<18:26, 2.93MB/s]
-model.safetensors:   4%|▍         | 138M/3.25G [00:02<00:47, 65.1MB/s]
-model.safetensors:   7%|▋         | 235M/3.25G [00:03<00:46, 65.4MB/s]
-model.safetensors:  28%|██▊       | 905M/3.25G [00:05<00:09, 258MB/s]
-model.safetensors:  46%|████▋     | 1.51G/3.25G [00:06<00:04, 360MB/s]
-model.safetensors:  71%|███████   | 2.31G/3.25G [00:07<00:01, 484MB/s]
-model.safetensors:  98%|█████████▊| 3.18G/3.25G [00:08<00:00, 593MB/s]
-model.safetensors: 100%|██████████| 3.25G/3.25G [00:08<00:00, 390MB/s]
-generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]
-generation_config.json: 100%|██████████| 124/124 [00:00<00:00, 1.04MB/s]
-Model loaded successfully
-Tokenizing input text...
-Tokenized input shape: torch.Size([1, 141])
-Moved inputs to device: cpu
-Running forward pass...
-`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
-Calculated loss: 2.1944427490234375
-Final perplexity: 8.974998474121094
-Perplexity evaluation completed: 8.974998474121094
-Created result structure: {'config': {'model_dtype': 'torch.float16', 'model_name': 'openai-community/gpt2-large', 'model_sha': 'main'}, 'results': {'perplexity': {'perplexity': 8.974998474121094}}}
-Saving result to: ./eval-results/openai-community/results_openai-community_gpt2-large_20250726_013038.json
-Result file saved locally
-Uploading to HF dataset: ahmedsqrd/results
-Upload completed successfully
-Evaluation result - Success: True, Result: 8.974998474121094
-Attempting to refresh leaderboard...
-=== REFRESH LEADERBOARD DEBUG ===
-Refreshing leaderboard data...
 === GET_LEADERBOARD_DF DEBUG ===
 Starting leaderboard creation...
 Looking for results in: ./eval-results
-Expected columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
 Benchmark columns: ['Perplexity']
 Searching for result files in: ./eval-results
-Found 8 result files
-Processing file: ./eval-results/EleutherAI/results_EleutherAI_gpt-neo-1.3B_20250726_010247.json
-Created result object for: EleutherAI/gpt-neo-1.3B
-Added new result for EleutherAI_gpt-neo-1.3B_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_231201.json
-Created result object for: openai-community/gpt2
-Added new result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_233155.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235115.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235748.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000358.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000650.json
-Created result object for: openai-community/gpt2
-Updated existing result for openai-community_gpt2_float16
-Processing file: ./eval-results/openai-community/results_openai-community_gpt2-large_20250726_013038.json
-Created result object for: openai-community/gpt2-large
-Added new result for openai-community_gpt2-large_float16
-Processing 3 evaluation results
-Converting result to dict for: EleutherAI/gpt-neo-1.3B
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: EleutherAI/gpt-neo-1.3B
-Raw results: {'perplexity': 5.9609375}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 5.9609375
-Converted score: 82.1477223263516
-Calculated average score: 82.1477223263516
-Created base data_dict with 13 columns
-Added task score: Perplexity = 5.9609375
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully converted and added result
-Converting result to dict for: openai-community/gpt2
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: openai-community/gpt2
-Raw results: {'perplexity': 20.663532257080078}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 20.663532257080078
-Converted score: 69.7162958010531
-Calculated average score: 69.7162958010531
-Created base data_dict with 13 columns
-Added task score: Perplexity = 20.663532257080078
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully converted and added result
-Converting result to dict for: openai-community/gpt2-large
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: openai-community/gpt2-large
-Raw results: {'perplexity': 8.974998474121094}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 8.974998474121094
-Converted score: 78.05557235640035
-Calculated average score: 78.05557235640035
-Created base data_dict with 13 columns
-Added task score: Perplexity = 8.974998474121094
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully converted and added result
-Returning 3 processed results
-Found 3 raw results
-Processing result 1/3: EleutherAI/gpt-neo-1.3B
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: EleutherAI/gpt-neo-1.3B
-Raw results: {'perplexity': 5.9609375}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 5.9609375
-Converted score: 82.1477223263516
-Calculated average score: 82.1477223263516
-Created base data_dict with 13 columns
-Added task score: Perplexity = 5.9609375
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully processed result 1/3: EleutherAI/gpt-neo-1.3B
-Processing result 2/3: openai-community/gpt2
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: openai-community/gpt2
-Raw results: {'perplexity': 20.663532257080078}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 20.663532257080078
-Converted score: 69.7162958010531
-Calculated average score: 69.7162958010531
-Created base data_dict with 13 columns
-Added task score: Perplexity = 20.663532257080078
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully processed result 2/3: openai-community/gpt2
-Processing result 3/3: openai-community/gpt2-large
-=== PROCESSING RESULT TO_DICT ===
-Processing result for model: openai-community/gpt2-large
-Raw results: {'perplexity': 8.974998474121094}
-Model precision: Precision.float16
-Model type: ModelType.PT
-Weight type: WeightType.Original
-Available tasks: ['task0']
-Looking for task: perplexity in results
-Found score for perplexity: 8.974998474121094
-Converted score: 78.05557235640035
-Calculated average score: 78.05557235640035
-Created base data_dict with 13 columns
-Added task score: Perplexity = 8.974998474121094
-Final data dict has 14 columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-=== END PROCESSING RESULT TO_DICT ===
-Successfully processed result 3/3: openai-community/gpt2-large
-Converted to 3 JSON records
-Sample record keys: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-Created DataFrame with columns: ['eval_name', 'Precision', 'Type', 'T', 'Weight type', 'Architecture', 'Model', 'Model sha', 'Average ⬆️', 'Available on the hub', 'Hub License', '#Params (B)', 'Hub ❤️', 'Perplexity']
-DataFrame shape: (3, 14)
-Sorted DataFrame by average
-Selected and rounded columns
-Final DataFrame shape after filtering: (3, 12)
-Final columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
-=== FINAL RESULT: DataFrame with 3 rows and 12 columns ===
-get_leaderboard_df returned: <class 'pandas.core.frame.DataFrame'>
-DataFrame shape: (3, 12)
-DataFrame columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
-DataFrame empty: False
-Final DataFrame for leaderboard - Shape: (3, 12), Columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
-Creating leaderboard component...
-=== Initializing Leaderboard ===
-DataFrame shape: (3, 12)
-DataFrame columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
-Leaderboard component created successfully
-Leaderboard refresh successful
-Traceback (most recent call last):
-  File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 625, in process_events
-    response = await route_utils.call_process_api(
-  File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 322, in call_process_api
-    output = await app.get_blocks().process_api(
-  File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 2106, in process_api
-    data = await self.postprocess_data(block_fn, result["prediction"], state)
-  File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1899, in postprocess_data
-    state[block._id] = block.__class__(**kwargs)
-  File "/usr/local/lib/python3.10/site-packages/gradio/component_meta.py", line 181, in wrapper
-    return fn(self, **kwargs)
-  File "/usr/local/lib/python3.10/site-packages/gradio_leaderboard/leaderboard.py", line 126, in __init__
-    raise ValueError("Leaderboard component must have a value set.")
-ValueError: Leaderboard component must have a value set.

+NCHMARK_COLS: ['Perplexity']
+=== END COLUMN SETUP ===
+🔧 CHECKING MODEL TRACING AVAILABILITY...
+   - Model tracing path: /home/user/app/src/evaluation/../../model-tracing
+   - Path exists: True
+   - main.py exists: True
+🎯 Final MODEL_TRACING_AVAILABLE = True
+.gitattributes:   0%|          | 0.00/2.46k [00:00<?, ?B/s]
+.gitattributes: 100%|██████████| 2.46k/2.46k [00:00<00:00, 10.1MB/s]
+(…)therAI_gpt-neo-1.3B_20250726_010247.json:   0%|          | 0.00/202 [00:00<?, ?B/s]
+(…)therAI_gpt-neo-1.3B_20250726_010247.json: 100%|██████████| 202/202 [00:00<00:00, 748kB/s]
+(…)s_facebook_opt-125m_20250726_020655.json:   0%|          | 0.00/205 [00:00<?, ?B/s]
+(…)s_facebook_opt-125m_20250726_020655.json: 100%|██████████| 205/205 [00:00<00:00, 909kB/s]
+(…)s_facebook_opt-350m_20250726_021737.json:   0%|          | 0.00/205 [00:00<?, ?B/s]
+(…)s_facebook_opt-350m_20250726_021737.json: 100%|██████████| 205/205 [00:00<00:00, 850kB/s]
+(…)ommunity_gpt2-large_20250726_013038.json:   0%|          | 0.00/214 [00:00<?, ?B/s]
+(…)ommunity_gpt2-large_20250726_013038.json: 100%|██████████| 214/214 [00:00<00:00, 1.03MB/s]
+(…)mmunity_gpt2-medium_20250726_015555.json:   0%|          | 0.00/216 [00:00<?, ?B/s]
+(…)mmunity_gpt2-medium_20250726_015555.json: 100%|██████████| 216/216 [00:00<00:00, 730kB/s]
+(…)enai-community_gpt2_20250725_231201.json:   0%|          | 0.00/209 [00:00<?, ?B/s]
+(…)enai-community_gpt2_20250725_231201.json: 100%|██████████| 209/209 [00:00<00:00, 533kB/s]
+(…)enai-community_gpt2_20250725_233155.json:   0%|          | 0.00/209 [00:00<?, ?B/s]
+(…)enai-community_gpt2_20250725_233155.json: 100%|██████████| 209/209 [00:00<00:00, 905kB/s]
+(…)enai-community_gpt2_20250725_235115.json:   0%|          | 0.00/209 [00:00<?, ?B/s]
+(…)enai-community_gpt2_20250725_235115.json: 100%|██████████| 209/209 [00:00<00:00, 801kB/s]
+(…)enai-community_gpt2_20250725_235748.json:   0%|          | 0.00/209 [00:00<?, ?B/s]
+(…)enai-community_gpt2_20250725_235748.json: 100%|██████████| 209/209 [00:00<00:00, 856kB/s]
+(…)enai-community_gpt2_20250726_000358.json:   0%|          | 0.00/209 [00:00<?, ?B/s]
+(…)enai-community_gpt2_20250726_000358.json: 100%|██████████| 209/209 [00:00<00:00, 696kB/s]
+(…)enai-community_gpt2_20250726_000650.json:   0%|          | 0.00/209 [00:00<?, ?B/s]
+(…)enai-community_gpt2_20250726_000650.json: 100%|██████████| 209/209 [00:00<00:00, 792kB/s]
+(…)enai-community_gpt2_20250726_015147.json:   0%|          | 0.00/209 [00:00<?, ?B/s]
+(…)enai-community_gpt2_20250726_015147.json: 100%|██████████| 209/209 [00:00<00:00, 1.12MB/s]
+🚀 STARTING GRADIO APP INITIALIZATION
+📊 Initializing allowed models...
+🚀 INITIALIZING ALLOWED MODELS
+📋 Models to initialize: ['lmsys/vicuna-7b-v1.5', 'ibm-granite/granite-7b-base', 'EleutherAI/llemma_7b']
+🧹 CLEANING NON-ALLOWED RESULT FILES
+🗑️ Removing non-allowed model result: ./eval-results/EleutherAI/results_EleutherAI_gpt-neo-1.3B_20250726_010247.json (model: EleutherAI/gpt-neo-1.3B)
+🗑️ Removing non-allowed model result: ./eval-results/facebook/results_facebook_opt-125m_20250726_020655.json (model: facebook/opt-125m)
+🗑️ Removing non-allowed model result: ./eval-results/facebook/results_facebook_opt-350m_20250726_021737.json (model: facebook/opt-350m)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2-large_20250726_013038.json (model: openai-community/gpt2-large)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2-medium_20250726_015555.json (model: openai-community/gpt2-medium)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250725_231201.json (model: openai-community/gpt2)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250725_233155.json (model: openai-community/gpt2)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235115.json (model: openai-community/gpt2)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250725_235748.json (model: openai-community/gpt2)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000358.json (model: openai-community/gpt2)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250726_000650.json (model: openai-community/gpt2)
+🗑️ Removing non-allowed model result: ./eval-results/openai-community/results_openai-community_gpt2_20250726_015147.json (model: openai-community/gpt2)
+✅ Removed 12 non-allowed result files
+🔧 CREATING RESULT FILE FOR: lmsys/vicuna-7b-v1.5
+📁 Result file path: ./eval-results/lmsys_vicuna_7b_v1.5_float16.json
+✅ Created result file: ./eval-results/lmsys_vicuna_7b_v1.5_float16.json
+🔧 CREATING RESULT FILE FOR: ibm-granite/granite-7b-base
+📁 Result file path: ./eval-results/ibm_granite_granite_7b_base_float16.json
+✅ Created result file: ./eval-results/ibm_granite_granite_7b_base_float16.json
+🔧 CREATING RESULT FILE FOR: EleutherAI/llemma_7b
+📁 Result file path: ./eval-results/EleutherAI_llemma_7b_float16.json
+✅ Created result file: ./eval-results/EleutherAI_llemma_7b_float16.json
+✅ Initialized 3 model result files
+📊 Creating initial results DataFrame...
+📊 CREATE_RESULTS_DATAFRAME CALLED
 === GET_LEADERBOARD_DF DEBUG ===
 Starting leaderboard creation...
 Looking for results in: ./eval-results
+Expected columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Match P-Value ⬇️', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
 Benchmark columns: ['Perplexity']
 Searching for result files in: ./eval-results
+Found 0 result files
+Processing 0 evaluation results
+Returning 0 processed results
+Found 0 raw results
+No raw data found, creating empty DataFrame
+Creating empty fallback DataFrame...
+Empty DataFrame created with columns: ['T', 'Model', 'Average ⬆️', 'Perplexity', 'Match P-Value ⬇️', 'Type', 'Architecture', 'Precision', 'Hub License', '#Params (B)', 'Hub ❤️', 'Available on the hub', 'Model sha']
+📋 Retrieved leaderboard df: (0, 13)
+⚠️ DataFrame is None or empty, returning empty DataFrame
+✅ Initial DataFrame created with shape: (0, 6)
+📋 Columns: ['Model', 'Perplexity', 'Match P-Value', 'Average Score', 'Type', 'Precision']
+🎨 Creating Gradio interface...
+🎯 GRADIO INTERFACE SETUP COMPLETE
+🚀 LAUNCHING GRADIO APP WITH MODEL TRACING INTEGRATION
+📊 Features enabled:
+   - Perplexity evaluation
+   - Model trace p-value computation (vs GPT-2 base)
+   - Match statistic with alignment
+🎉 Ready to accept requests!
+* Running on local URL:  http://0.0.0.0:7860, with SSR ⚡ (experimental, to disable set `ssr=False` in `launch()`)

src/leaderboard/read_evals.py CHANGED Viewed

@@ -192,12 +192,10 @@ def get_raw_eval_results(results_path: str) -> list[EvalResult]:
     model_result_filepaths = []
     for root, _, files in os.walk(results_path):
-        # We should only have json files in model results
-        if len(files) == 0 or any([not f.endswith(".json") for f in files]):
-            continue
         for file in files:
-            model_result_filepaths.append(os.path.join(root, file))
     sys.stderr.write(f"Found {len(model_result_filepaths)} result files\n")
     sys.stderr.flush()

     model_result_filepaths = []
     for root, _, files in os.walk(results_path):
+        # Process all JSON files, regardless of other files in the directory
         for file in files:
+            if file.endswith(".json"):
+                model_result_filepaths.append(os.path.join(root, file))
     sys.stderr.write(f"Found {len(model_result_filepaths)} result files\n")
     sys.stderr.flush()