| 
							 | 
						+ deepspeed  | 
					
					
						
						| 
							 | 
						[rank7]:[W529 16:17:50.078213030 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. | 
					
					
						
						| 
							 | 
						[rank5]:[W529 16:17:50.078780971 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. | 
					
					
						
						| 
							 | 
						[rank1]:[W529 16:17:51.213410228 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. | 
					
					
						
						| 
							 | 
						[rank6]:[W529 16:17:51.231641884 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. | 
					
					
						
						| 
							 | 
						[rank0]:[W529 16:17:51.241563165 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. | 
					
					
						
						| 
							 | 
						[rank2]:[W529 16:17:51.313042509 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. | 
					
					
						
						| 
							 | 
						[rank4]:[W529 16:17:51.476628094 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. | 
					
					
						
						| 
							 | 
						[rank3]:[W529 16:17:51.597440962 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/config.json | 
					
					
						
						| 
							 | 
						Model config LlamaConfig { | 
					
					
						
						| 
							 | 
						  "architectures": [ | 
					
					
						
						| 
							 | 
						    "LlamaForCausalLM" | 
					
					
						
						| 
							 | 
						  ], | 
					
					
						
						| 
							 | 
						  "attention_bias": false, | 
					
					
						
						| 
							 | 
						  "attention_dropout": 0.0, | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "head_dim": 64, | 
					
					
						
						| 
							 | 
						  "hidden_act": "silu", | 
					
					
						
						| 
							 | 
						  "hidden_size": 2048, | 
					
					
						
						| 
							 | 
						  "initializer_range": 0.02, | 
					
					
						
						| 
							 | 
						  "intermediate_size": 5632, | 
					
					
						
						| 
							 | 
						  "max_position_embeddings": 2048, | 
					
					
						
						| 
							 | 
						  "mlp_bias": false, | 
					
					
						
						| 
							 | 
						  "model_type": "llama", | 
					
					
						
						| 
							 | 
						  "num_attention_heads": 32, | 
					
					
						
						| 
							 | 
						  "num_hidden_layers": 22, | 
					
					
						
						| 
							 | 
						  "num_key_value_heads": 4, | 
					
					
						
						| 
							 | 
						  "pretraining_tp": 1, | 
					
					
						
						| 
							 | 
						  "rms_norm_eps": 1e-05, | 
					
					
						
						| 
							 | 
						  "rope_scaling": null, | 
					
					
						
						| 
							 | 
						  "rope_theta": 10000.0, | 
					
					
						
						| 
							 | 
						  "tie_word_embeddings": false, | 
					
					
						
						| 
							 | 
						  "torch_dtype": "float32", | 
					
					
						
						| 
							 | 
						  "transformers_version": "4.52.1", | 
					
					
						
						| 
							 | 
						  "use_cache": true, | 
					
					
						
						| 
							 | 
						  "vocab_size": 32000 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model config LlamaConfig { | 
					
					
						
						| 
							 | 
						  "architectures": [ | 
					
					
						
						| 
							 | 
						    "LlamaForCausalLM" | 
					
					
						
						| 
							 | 
						  ], | 
					
					
						
						| 
							 | 
						  "attention_bias": false, | 
					
					
						
						| 
							 | 
						  "attention_dropout": 0.0, | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "head_dim": 64, | 
					
					
						
						| 
							 | 
						  "hidden_act": "silu", | 
					
					
						
						| 
							 | 
						  "hidden_size": 2048, | 
					
					
						
						| 
							 | 
						  "initializer_range": 0.02, | 
					
					
						
						| 
							 | 
						  "intermediate_size": 5632, | 
					
					
						
						| 
							 | 
						  "max_position_embeddings": 2048, | 
					
					
						
						| 
							 | 
						  "mlp_bias": false, | 
					
					
						
						| 
							 | 
						  "model_type": "llama", | 
					
					
						
						| 
							 | 
						  "num_attention_heads": 32, | 
					
					
						
						| 
							 | 
						  "num_hidden_layers": 22, | 
					
					
						
						| 
							 | 
						  "num_key_value_heads": 4, | 
					
					
						
						| 
							 | 
						  "pretraining_tp": 1, | 
					
					
						
						| 
							 | 
						  "rms_norm_eps": 1e-05, | 
					
					
						
						| 
							 | 
						  "rope_scaling": null, | 
					
					
						
						| 
							 | 
						  "rope_theta": 10000.0, | 
					
					
						
						| 
							 | 
						  "tie_word_embeddings": false, | 
					
					
						
						| 
							 | 
						  "torch_dtype": "float32", | 
					
					
						
						| 
							 | 
						  "transformers_version": "4.52.1", | 
					
					
						
						| 
							 | 
						  "use_cache": true, | 
					
					
						
						| 
							 | 
						  "vocab_size": 32000 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model config LlamaConfig { | 
					
					
						
						| 
							 | 
						  "architectures": [ | 
					
					
						
						| 
							 | 
						    "LlamaForCausalLM" | 
					
					
						
						| 
							 | 
						  ], | 
					
					
						
						| 
							 | 
						  "attention_bias": false, | 
					
					
						
						| 
							 | 
						  "attention_dropout": 0.0, | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "head_dim": 64, | 
					
					
						
						| 
							 | 
						  "hidden_act": "silu", | 
					
					
						
						| 
							 | 
						  "hidden_size": 2048, | 
					
					
						
						| 
							 | 
						  "initializer_range": 0.02, | 
					
					
						
						| 
							 | 
						  "intermediate_size": 5632, | 
					
					
						
						| 
							 | 
						  "max_position_embeddings": 2048, | 
					
					
						
						| 
							 | 
						  "mlp_bias": false, | 
					
					
						
						| 
							 | 
						  "model_type": "llama", | 
					
					
						
						| 
							 | 
						  "num_attention_heads": 32, | 
					
					
						
						| 
							 | 
						  "num_hidden_layers": 22, | 
					
					
						
						| 
							 | 
						  "num_key_value_heads": 4, | 
					
					
						
						| 
							 | 
						  "pretraining_tp": 1, | 
					
					
						
						| 
							 | 
						  "rms_norm_eps": 1e-05, | 
					
					
						
						| 
							 | 
						  "rope_scaling": null, | 
					
					
						
						| 
							 | 
						  "rope_theta": 10000.0, | 
					
					
						
						| 
							 | 
						  "tie_word_embeddings": false, | 
					
					
						
						| 
							 | 
						  "torch_dtype": "float32", | 
					
					
						
						| 
							 | 
						  "transformers_version": "4.52.1", | 
					
					
						
						| 
							 | 
						  "use_cache": true, | 
					
					
						
						| 
							 | 
						  "vocab_size": 32000 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model config LlamaConfig { | 
					
					
						
						| 
							 | 
						  "architectures": [ | 
					
					
						
						| 
							 | 
						    "LlamaForCausalLM" | 
					
					
						
						| 
							 | 
						  ], | 
					
					
						
						| 
							 | 
						  "attention_bias": false, | 
					
					
						
						| 
							 | 
						  "attention_dropout": 0.0, | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "head_dim": 64, | 
					
					
						
						| 
							 | 
						  "hidden_act": "silu", | 
					
					
						
						| 
							 | 
						  "hidden_size": 2048, | 
					
					
						
						| 
							 | 
						  "initializer_range": 0.02, | 
					
					
						
						| 
							 | 
						  "intermediate_size": 5632, | 
					
					
						
						| 
							 | 
						  "max_position_embeddings": 2048, | 
					
					
						
						| 
							 | 
						  "mlp_bias": false, | 
					
					
						
						| 
							 | 
						  "model_type": "llama", | 
					
					
						
						| 
							 | 
						  "num_attention_heads": 32, | 
					
					
						
						| 
							 | 
						  "num_hidden_layers": 22, | 
					
					
						
						| 
							 | 
						  "num_key_value_heads": 4, | 
					
					
						
						| 
							 | 
						  "pretraining_tp": 1, | 
					
					
						
						| 
							 | 
						  "rms_norm_eps": 1e-05, | 
					
					
						
						| 
							 | 
						  "rope_scaling": null, | 
					
					
						
						| 
							 | 
						  "rope_theta": 10000.0, | 
					
					
						
						| 
							 | 
						  "tie_word_embeddings": false, | 
					
					
						
						| 
							 | 
						  "torch_dtype": "float32", | 
					
					
						
						| 
							 | 
						  "transformers_version": "4.52.1", | 
					
					
						
						| 
							 | 
						  "use_cache": true, | 
					
					
						
						| 
							 | 
						  "vocab_size": 32000 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model config LlamaConfig { | 
					
					
						
						| 
							 | 
						  "architectures": [ | 
					
					
						
						| 
							 | 
						    "LlamaForCausalLM" | 
					
					
						
						| 
							 | 
						  ], | 
					
					
						
						| 
							 | 
						  "attention_bias": false, | 
					
					
						
						| 
							 | 
						  "attention_dropout": 0.0, | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "head_dim": 64, | 
					
					
						
						| 
							 | 
						  "hidden_act": "silu", | 
					
					
						
						| 
							 | 
						  "hidden_size": 2048, | 
					
					
						
						| 
							 | 
						  "initializer_range": 0.02, | 
					
					
						
						| 
							 | 
						  "intermediate_size": 5632, | 
					
					
						
						| 
							 | 
						  "max_position_embeddings": 2048, | 
					
					
						
						| 
							 | 
						  "mlp_bias": false, | 
					
					
						
						| 
							 | 
						  "model_type": "llama", | 
					
					
						
						| 
							 | 
						  "num_attention_heads": 32, | 
					
					
						
						| 
							 | 
						  "num_hidden_layers": 22, | 
					
					
						
						| 
							 | 
						  "num_key_value_heads": 4, | 
					
					
						
						| 
							 | 
						  "pretraining_tp": 1, | 
					
					
						
						| 
							 | 
						  "rms_norm_eps": 1e-05, | 
					
					
						
						| 
							 | 
						  "rope_scaling": null, | 
					
					
						
						| 
							 | 
						  "rope_theta": 10000.0, | 
					
					
						
						| 
							 | 
						  "tie_word_embeddings": false, | 
					
					
						
						| 
							 | 
						  "torch_dtype": "float32", | 
					
					
						
						| 
							 | 
						  "transformers_version": "4.52.1", | 
					
					
						
						| 
							 | 
						  "use_cache": true, | 
					
					
						
						| 
							 | 
						  "vocab_size": 32000 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model config LlamaConfig { | 
					
					
						
						| 
							 | 
						  "architectures": [ | 
					
					
						
						| 
							 | 
						    "LlamaForCausalLM" | 
					
					
						
						| 
							 | 
						  ], | 
					
					
						
						| 
							 | 
						  "attention_bias": false, | 
					
					
						
						| 
							 | 
						  "attention_dropout": 0.0, | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "head_dim": 64, | 
					
					
						
						| 
							 | 
						  "hidden_act": "silu", | 
					
					
						
						| 
							 | 
						  "hidden_size": 2048, | 
					
					
						
						| 
							 | 
						  "initializer_range": 0.02, | 
					
					
						
						| 
							 | 
						  "intermediate_size": 5632, | 
					
					
						
						| 
							 | 
						  "max_position_embeddings": 2048, | 
					
					
						
						| 
							 | 
						  "mlp_bias": false, | 
					
					
						
						| 
							 | 
						  "model_type": "llama", | 
					
					
						
						| 
							 | 
						  "num_attention_heads": 32, | 
					
					
						
						| 
							 | 
						  "num_hidden_layers": 22, | 
					
					
						
						| 
							 | 
						  "num_key_value_heads": 4, | 
					
					
						
						| 
							 | 
						  "pretraining_tp": 1, | 
					
					
						
						| 
							 | 
						  "rms_norm_eps": 1e-05, | 
					
					
						
						| 
							 | 
						  "rope_scaling": null, | 
					
					
						
						| 
							 | 
						  "rope_theta": 10000.0, | 
					
					
						
						| 
							 | 
						  "tie_word_embeddings": false, | 
					
					
						
						| 
							 | 
						  "torch_dtype": "float32", | 
					
					
						
						| 
							 | 
						  "transformers_version": "4.52.1", | 
					
					
						
						| 
							 | 
						  "use_cache": true, | 
					
					
						
						| 
							 | 
						  "vocab_size": 32000 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model config LlamaConfig { | 
					
					
						
						| 
							 | 
						  "architectures": [ | 
					
					
						
						| 
							 | 
						    "LlamaForCausalLM" | 
					
					
						
						| 
							 | 
						  ], | 
					
					
						
						| 
							 | 
						  "attention_bias": false, | 
					
					
						
						| 
							 | 
						  "attention_dropout": 0.0, | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "head_dim": 64, | 
					
					
						
						| 
							 | 
						  "hidden_act": "silu", | 
					
					
						
						| 
							 | 
						  "hidden_size": 2048, | 
					
					
						
						| 
							 | 
						  "initializer_range": 0.02, | 
					
					
						
						| 
							 | 
						  "intermediate_size": 5632, | 
					
					
						
						| 
							 | 
						  "max_position_embeddings": 2048, | 
					
					
						
						| 
							 | 
						  "mlp_bias": false, | 
					
					
						
						| 
							 | 
						  "model_type": "llama", | 
					
					
						
						| 
							 | 
						  "num_attention_heads": 32, | 
					
					
						
						| 
							 | 
						  "num_hidden_layers": 22, | 
					
					
						
						| 
							 | 
						  "num_key_value_heads": 4, | 
					
					
						
						| 
							 | 
						  "pretraining_tp": 1, | 
					
					
						
						| 
							 | 
						  "rms_norm_eps": 1e-05, | 
					
					
						
						| 
							 | 
						  "rope_scaling": null, | 
					
					
						
						| 
							 | 
						  "rope_theta": 10000.0, | 
					
					
						
						| 
							 | 
						  "tie_word_embeddings": false, | 
					
					
						
						| 
							 | 
						  "torch_dtype": "float32", | 
					
					
						
						| 
							 | 
						  "transformers_version": "4.52.1", | 
					
					
						
						| 
							 | 
						  "use_cache": true, | 
					
					
						
						| 
							 | 
						  "vocab_size": 32000 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model config LlamaConfig { | 
					
					
						
						| 
							 | 
						  "architectures": [ | 
					
					
						
						| 
							 | 
						    "LlamaForCausalLM" | 
					
					
						
						| 
							 | 
						  ], | 
					
					
						
						| 
							 | 
						  "attention_bias": false, | 
					
					
						
						| 
							 | 
						  "attention_dropout": 0.0, | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "head_dim": 64, | 
					
					
						
						| 
							 | 
						  "hidden_act": "silu", | 
					
					
						
						| 
							 | 
						  "hidden_size": 2048, | 
					
					
						
						| 
							 | 
						  "initializer_range": 0.02, | 
					
					
						
						| 
							 | 
						  "intermediate_size": 5632, | 
					
					
						
						| 
							 | 
						  "max_position_embeddings": 2048, | 
					
					
						
						| 
							 | 
						  "mlp_bias": false, | 
					
					
						
						| 
							 | 
						  "model_type": "llama", | 
					
					
						
						| 
							 | 
						  "num_attention_heads": 32, | 
					
					
						
						| 
							 | 
						  "num_hidden_layers": 22, | 
					
					
						
						| 
							 | 
						  "num_key_value_heads": 4, | 
					
					
						
						| 
							 | 
						  "pretraining_tp": 1, | 
					
					
						
						| 
							 | 
						  "rms_norm_eps": 1e-05, | 
					
					
						
						| 
							 | 
						  "rope_scaling": null, | 
					
					
						
						| 
							 | 
						  "rope_theta": 10000.0, | 
					
					
						
						| 
							 | 
						  "tie_word_embeddings": false, | 
					
					
						
						| 
							 | 
						  "torch_dtype": "float32", | 
					
					
						
						| 
							 | 
						  "transformers_version": "4.52.1", | 
					
					
						
						| 
							 | 
						  "use_cache": true, | 
					
					
						
						| 
							 | 
						  "vocab_size": 32000 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/model.safetensors | 
					
					
						
						| 
							 | 
						loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/model.safetensors | 
					
					
						
						| 
							 | 
						loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/model.safetensors | 
					
					
						
						| 
							 | 
						loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/model.safetensors | 
					
					
						
						| 
							 | 
						loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/model.safetensors | 
					
					
						
						| 
							 | 
						loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/model.safetensors | 
					
					
						
						| 
							 | 
						loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/model.safetensors | 
					
					
						
						| 
							 | 
						loading weights file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/model.safetensors | 
					
					
						
						| 
							 | 
						Will use torch_dtype=torch.float32 as defined in model's config object | 
					
					
						
						| 
							 | 
						Instantiating LlamaForCausalLM model under default dtype torch.float32. | 
					
					
						
						| 
							 | 
						Will use torch_dtype=torch.float32 as defined in model's config object | 
					
					
						
						| 
							 | 
						Will use torch_dtype=torch.float32 as defined in model's config object | 
					
					
						
						| 
							 | 
						Detected DeepSpeed ZeRO-3: activating zero.init() for this model | 
					
					
						
						| 
							 | 
						Instantiating LlamaForCausalLM model under default dtype torch.float32. | 
					
					
						
						| 
							 | 
						Instantiating LlamaForCausalLM model under default dtype torch.float32. | 
					
					
						
						| 
							 | 
						Will use torch_dtype=torch.float32 as defined in model's config object | 
					
					
						
						| 
							 | 
						Detected DeepSpeed ZeRO-3: activating zero.init() for this model | 
					
					
						
						| 
							 | 
						Detected DeepSpeed ZeRO-3: activating zero.init() for this model | 
					
					
						
						| 
							 | 
						Instantiating LlamaForCausalLM model under default dtype torch.float32. | 
					
					
						
						| 
							 | 
						Will use torch_dtype=torch.float32 as defined in model's config object | 
					
					
						
						| 
							 | 
						Instantiating LlamaForCausalLM model under default dtype torch.float32. | 
					
					
						
						| 
							 | 
						Detected DeepSpeed ZeRO-3: activating zero.init() for this model | 
					
					
						
						| 
							 | 
						Will use torch_dtype=torch.float32 as defined in model's config object | 
					
					
						
						| 
							 | 
						Will use torch_dtype=torch.float32 as defined in model's config object | 
					
					
						
						| 
							 | 
						Detected DeepSpeed ZeRO-3: activating zero.init() for this model | 
					
					
						
						| 
							 | 
						Instantiating LlamaForCausalLM model under default dtype torch.float32. | 
					
					
						
						| 
							 | 
						Instantiating LlamaForCausalLM model under default dtype torch.float32. | 
					
					
						
						| 
							 | 
						Detected DeepSpeed ZeRO-3: activating zero.init() for this model | 
					
					
						
						| 
							 | 
						Detected DeepSpeed ZeRO-3: activating zero.init() for this model | 
					
					
						
						| 
							 | 
						Will use torch_dtype=torch.float32 as defined in model's config object | 
					
					
						
						| 
							 | 
						Instantiating LlamaForCausalLM model under default dtype torch.float32. | 
					
					
						
						| 
							 | 
						Detected DeepSpeed ZeRO-3: activating zero.init() for this model | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All model checkpoint weights were used when initializing LlamaForCausalLM. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b. | 
					
					
						
						| 
							 | 
						If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. | 
					
					
						
						| 
							 | 
						All model checkpoint weights were used when initializing LlamaForCausalLM. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All model checkpoint weights were used when initializing LlamaForCausalLM. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b. | 
					
					
						
						| 
							 | 
						If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. | 
					
					
						
						| 
							 | 
						All model checkpoint weights were used when initializing LlamaForCausalLM. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All model checkpoint weights were used when initializing LlamaForCausalLM. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b. | 
					
					
						
						| 
							 | 
						If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. | 
					
					
						
						| 
							 | 
						All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b. | 
					
					
						
						| 
							 | 
						If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. | 
					
					
						
						| 
							 | 
						All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b. | 
					
					
						
						| 
							 | 
						If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. | 
					
					
						
						| 
							 | 
						All model checkpoint weights were used when initializing LlamaForCausalLM. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All model checkpoint weights were used when initializing LlamaForCausalLM. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b. | 
					
					
						
						| 
							 | 
						If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. | 
					
					
						
						| 
							 | 
						All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b. | 
					
					
						
						| 
							 | 
						If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/generation_config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/generation_config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/generation_config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/generation_config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/generation_config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/generation_config.json | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/generation_config.json | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "max_length": 2048, | 
					
					
						
						| 
							 | 
						  "pad_token_id": 0 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "max_length": 2048, | 
					
					
						
						| 
							 | 
						  "pad_token_id": 0 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "max_length": 2048, | 
					
					
						
						| 
							 | 
						  "pad_token_id": 0 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "max_length": 2048, | 
					
					
						
						| 
							 | 
						  "pad_token_id": 0 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "max_length": 2048, | 
					
					
						
						| 
							 | 
						  "pad_token_id": 0 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "max_length": 2048, | 
					
					
						
						| 
							 | 
						  "pad_token_id": 0 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "max_length": 2048, | 
					
					
						
						| 
							 | 
						  "pad_token_id": 0 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						loading file tokenizer.model | 
					
					
						
						| 
							 | 
						loading file tokenizer.model | 
					
					
						
						| 
							 | 
						loading file tokenizer.model | 
					
					
						
						| 
							 | 
						loading file tokenizer.model | 
					
					
						
						| 
							 | 
						loading file tokenizer.model | 
					
					
						
						| 
							 | 
						loading file tokenizer.json | 
					
					
						
						| 
							 | 
						loading file tokenizer.json | 
					
					
						
						| 
							 | 
						loading file tokenizer.json | 
					
					
						
						| 
							 | 
						loading file tokenizer.json | 
					
					
						
						| 
							 | 
						loading file tokenizer.json | 
					
					
						
						| 
							 | 
						loading file added_tokens.json | 
					
					
						
						| 
							 | 
						loading file added_tokens.json | 
					
					
						
						| 
							 | 
						loading file added_tokens.json | 
					
					
						
						| 
							 | 
						loading file added_tokens.json | 
					
					
						
						| 
							 | 
						loading file added_tokens.json | 
					
					
						
						| 
							 | 
						loading file special_tokens_map.json | 
					
					
						
						| 
							 | 
						loading file special_tokens_map.json | 
					
					
						
						| 
							 | 
						loading file special_tokens_map.json | 
					
					
						
						| 
							 | 
						loading file special_tokens_map.json | 
					
					
						
						| 
							 | 
						loading file special_tokens_map.json | 
					
					
						
						| 
							 | 
						loading file tokenizer_config.json | 
					
					
						
						| 
							 | 
						loading file tokenizer_config.json | 
					
					
						
						| 
							 | 
						loading file tokenizer_config.json | 
					
					
						
						| 
							 | 
						loading file tokenizer_config.json | 
					
					
						
						| 
							 | 
						loading file tokenizer_config.json | 
					
					
						
						| 
							 | 
						loading file chat_template.jinja | 
					
					
						
						| 
							 | 
						loading file chat_template.jinja | 
					
					
						
						| 
							 | 
						loading file chat_template.jinja | 
					
					
						
						| 
							 | 
						loading file tokenizer.model | 
					
					
						
						| 
							 | 
						loading file chat_template.jinja | 
					
					
						
						| 
							 | 
						loading file chat_template.jinja | 
					
					
						
						| 
							 | 
						loading file tokenizer.json | 
					
					
						
						| 
							 | 
						loading file added_tokens.json | 
					
					
						
						| 
							 | 
						loading file special_tokens_map.json | 
					
					
						
						| 
							 | 
						loading file tokenizer_config.json | 
					
					
						
						| 
							 | 
						loading file chat_template.jinja | 
					
					
						
						| 
							 | 
						loading file tokenizer.model | 
					
					
						
						| 
							 | 
						loading file tokenizer.json | 
					
					
						
						| 
							 | 
						loading file added_tokens.json | 
					
					
						
						| 
							 | 
						loading file special_tokens_map.json | 
					
					
						
						| 
							 | 
						loading file tokenizer_config.json | 
					
					
						
						| 
							 | 
						loading file chat_template.jinja | 
					
					
						
						| 
							 | 
						You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | 
					
					
						
						| 
							 | 
						You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | 
					
					
						
						| 
							 | 
						You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | 
					
					
						
						| 
							 | 
						You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | 
					
					
						
						| 
							 | 
						You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | 
					
					
						
						| 
							 | 
						You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | 
					
					
						
						| 
							 | 
						You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | 
					
					
						
						| 
							 | 
						All model checkpoint weights were used when initializing LlamaForCausalLM. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b. | 
					
					
						
						| 
							 | 
						If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. | 
					
					
						
						| 
							 | 
						loading configuration file /aifs4su/hansirui_1st/models/TinyLlama-1.1B-intermediate-step-240k-503b/generation_config.json | 
					
					
						
						| 
							 | 
						Generate config GenerationConfig { | 
					
					
						
						| 
							 | 
						  "bos_token_id": 1, | 
					
					
						
						| 
							 | 
						  "eos_token_id": 2, | 
					
					
						
						| 
							 | 
						  "max_length": 2048, | 
					
					
						
						| 
							 | 
						  "pad_token_id": 0 | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						loading file tokenizer.model | 
					
					
						
						| 
							 | 
						loading file tokenizer.json | 
					
					
						
						| 
							 | 
						loading file added_tokens.json | 
					
					
						
						| 
							 | 
						loading file special_tokens_map.json | 
					
					
						
						| 
							 | 
						loading file tokenizer_config.json | 
					
					
						
						| 
							 | 
						loading file chat_template.jinja | 
					
					
						
						| 
							 | 
						You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc | 
					
					
						
						| 
							 | 
						The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False` | 
					
					
						
						| 
							 | 
						Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | 
					
					
						
						| 
							 | 
						Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | 
					
					
						
						| 
							 | 
						Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | 
					
					
						
						| 
							 | 
						Using /home/hansirui_1st/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... | 
					
					
						
						| 
							 | 
						Detected CUDA files, patching ldflags | 
					
					
						
						| 
							 | 
						Emitting ninja build file /home/hansirui_1st/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja... | 
					
					
						
						| 
							 | 
						/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.  | 
					
					
						
						| 
							 | 
						If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. | 
					
					
						
						| 
							 | 
						  warnings.warn( | 
					
					
						
						| 
							 | 
						Building extension module fused_adam... | 
					
					
						
						| 
							 | 
						Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) | 
					
					
						
						| 
							 | 
						Loading extension module fused_adam... | 
					
					
						
						| 
							 | 
						Loading extension module fused_adam...Loading extension module fused_adam... | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Loading extension module fused_adam... | 
					
					
						
						| 
							 | 
						Loading extension module fused_adam... | 
					
					
						
						| 
							 | 
						Loading extension module fused_adam... | 
					
					
						
						| 
							 | 
						Loading extension module fused_adam... | 
					
					
						
						| 
							 | 
						Loading extension module fused_adam... | 
					
					
						
						| 
							 | 
						`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | 
					
					
						
						| 
							 | 
						`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | 
					
					
						
						| 
							 | 
						`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | 
					
					
						
						| 
							 | 
						`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | 
					
					
						
						| 
							 | 
						`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | 
					
					
						
						| 
							 | 
						`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | 
					
					
						
						| 
							 | 
						`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | 
					
					
						
						| 
							 | 
						wandb: Currently logged in as: xtom to https://api.wandb.ai. Use `wandb login  | 
					
					
						
						| 
							 | 
						wandb: Tracking run with wandb version 0.19.11 | 
					
					
						
						| 
							 | 
						wandb: Run data is saved locally in /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-0.5T/tinyllama-0.5T-s3-Q1-1000/wandb/run-20250529_161834-zgjhuyns | 
					
					
						
						| 
							 | 
						wandb: Run `wandb offline` to turn off syncing. | 
					
					
						
						| 
							 | 
						wandb: Syncing run imdb-tinyllama-0.5T-s3-Q1-1000 | 
					
					
						
						| 
							 | 
						wandb: βοΈ View project at https://wandb.ai/xtom/Inverse_Alignment_IMDb | 
					
					
						
						| 
							 | 
						wandb: π View run at https://wandb.ai/xtom/Inverse_Alignment_IMDb/runs/zgjhuyns | 
					
					
						
						| 
							 | 
						
Training 1/1 epoch:   0%|          | 0/125 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. | 
					
					
						
						| 
							 | 
						
Training 1/1 epoch (loss 4.6658):   0%|          | 0/125 [00:05<?, ?it/s]
Training 1/1 epoch (loss 4.6658):   1%|          | 1/125 [00:05<11:40,  5.65s/it]
Training 1/1 epoch (loss 4.7332):   1%|          | 1/125 [00:07<11:40,  5.65s/it]
Training 1/1 epoch (loss 4.7332):   2%|β         | 2/125 [00:07<06:38,  3.24s/it]
Training 1/1 epoch (loss 4.8235):   2%|β         | 2/125 [00:07<06:38,  3.24s/it]
Training 1/1 epoch (loss 4.8235):   2%|β         | 3/125 [00:07<03:52,  1.91s/it]
Training 1/1 epoch (loss 4.7543):   2%|β         | 3/125 [00:07<03:52,  1.91s/it]
Training 1/1 epoch (loss 4.7543):   3%|β         | 4/125 [00:07<02:37,  1.30s/it]
Training 1/1 epoch (loss 4.7108):   3%|β         | 4/125 [00:08<02:37,  1.30s/it]
Training 1/1 epoch (loss 4.7108):   4%|β         | 5/125 [00:08<01:57,  1.02it/s]
Training 1/1 epoch (loss 4.8127):   4%|β         | 5/125 [00:08<01:57,  1.02it/s]
Training 1/1 epoch (loss 4.8127):   5%|β         | 6/125 [00:08<01:29,  1.33it/s]
Training 1/1 epoch (loss 4.6741):   5%|β         | 6/125 [00:08<01:29,  1.33it/s]
Training 1/1 epoch (loss 4.6741):   6%|β         | 7/125 [00:08<01:12,  1.62it/s]
Training 1/1 epoch (loss 4.6711):   6%|β         | 7/125 [00:09<01:12,  1.62it/s]
Training 1/1 epoch (loss 4.6711):   6%|β         | 8/125 [00:09<01:05,  1.80it/s]
Training 1/1 epoch (loss 4.9854):   6%|β         | 8/125 [00:09<01:05,  1.80it/s]
Training 1/1 epoch (loss 4.9854):   7%|β         | 9/125 [00:09<00:57,  2.01it/s]
Training 1/1 epoch (loss 4.6455):   7%|β         | 9/125 [00:10<00:57,  2.01it/s]
Training 1/1 epoch (loss 4.6455):   8%|β         | 10/125 [00:10<00:52,  2.18it/s]
Training 1/1 epoch (loss 4.5396):   8%|β         | 10/125 [00:10<00:52,  2.18it/s]
Training 1/1 epoch (loss 4.5396):   9%|β         | 11/125 [00:10<00:47,  2.40it/s]
Training 1/1 epoch (loss 4.5384):   9%|β         | 11/125 [00:10<00:47,  2.40it/s]
Training 1/1 epoch (loss 4.5384):  10%|β         | 12/125 [00:10<00:43,  2.60it/s]
Training 1/1 epoch (loss 4.6236):  10%|β         | 12/125 [00:11<00:43,  2.60it/s]
Training 1/1 epoch (loss 4.6236):  10%|β         | 13/125 [00:11<00:41,  2.69it/s]
Training 1/1 epoch (loss 4.6466):  10%|β         | 13/125 [00:11<00:41,  2.69it/s]
Training 1/1 epoch (loss 4.6466):  11%|β         | 14/125 [00:11<00:40,  2.77it/s]
Training 1/1 epoch (loss 4.4581):  11%|β         | 14/125 [00:11<00:40,  2.77it/s]
Training 1/1 epoch (loss 4.4581):  12%|ββ        | 15/125 [00:11<00:37,  2.91it/s]
Training 1/1 epoch (loss 4.9257):  12%|ββ        | 15/125 [00:12<00:37,  2.91it/s]
Training 1/1 epoch (loss 4.9257):  13%|ββ        | 16/125 [00:12<00:37,  2.88it/s]
Training 1/1 epoch (loss 4.4773):  13%|ββ        | 16/125 [00:12<00:37,  2.88it/s]
Training 1/1 epoch (loss 4.4773):  14%|ββ        | 17/125 [00:12<00:38,  2.79it/s]
Training 1/1 epoch (loss 4.5525):  14%|ββ        | 17/125 [00:12<00:38,  2.79it/s]
Training 1/1 epoch (loss 4.5525):  14%|ββ        | 18/125 [00:12<00:38,  2.81it/s]
Training 1/1 epoch (loss 4.3950):  14%|ββ        | 18/125 [00:13<00:38,  2.81it/s]
Training 1/1 epoch (loss 4.3950):  15%|ββ        | 19/125 [00:13<00:36,  2.88it/s]
Training 1/1 epoch (loss 4.4979):  15%|ββ        | 19/125 [00:13<00:36,  2.88it/s]
Training 1/1 epoch (loss 4.4979):  16%|ββ        | 20/125 [00:13<00:35,  2.97it/s]
Training 1/1 epoch (loss 4.5141):  16%|ββ        | 20/125 [00:13<00:35,  2.97it/s]
Training 1/1 epoch (loss 4.5141):  17%|ββ        | 21/125 [00:13<00:34,  3.04it/s]
Training 1/1 epoch (loss 4.3697):  17%|ββ        | 21/125 [00:14<00:34,  3.04it/s]
Training 1/1 epoch (loss 4.3697):  18%|ββ        | 22/125 [00:14<00:35,  2.94it/s]
Training 1/1 epoch (loss 4.1615):  18%|ββ        | 22/125 [00:14<00:35,  2.94it/s]
Training 1/1 epoch (loss 4.1615):  18%|ββ        | 23/125 [00:14<00:34,  2.97it/s]
Training 1/1 epoch (loss 4.4968):  18%|ββ        | 23/125 [00:14<00:34,  2.97it/s]
Training 1/1 epoch (loss 4.4968):  19%|ββ        | 24/125 [00:14<00:37,  2.68it/s]
Training 1/1 epoch (loss 3.9629):  19%|ββ        | 24/125 [00:15<00:37,  2.68it/s]
Training 1/1 epoch (loss 3.9629):  20%|ββ        | 25/125 [00:15<00:35,  2.80it/s]
Training 1/1 epoch (loss 4.3490):  20%|ββ        | 25/125 [00:15<00:35,  2.80it/s]
Training 1/1 epoch (loss 4.3490):  21%|ββ        | 26/125 [00:15<00:34,  2.86it/s]
Training 1/1 epoch (loss 4.1854):  21%|ββ        | 26/125 [00:15<00:34,  2.86it/s]
Training 1/1 epoch (loss 4.1854):  22%|βββ       | 27/125 [00:15<00:33,  2.95it/s]
Training 1/1 epoch (loss 4.2614):  22%|βββ       | 27/125 [00:16<00:33,  2.95it/s]
Training 1/1 epoch (loss 4.2614):  22%|βββ       | 28/125 [00:16<00:34,  2.80it/s]
Training 1/1 epoch (loss 4.2555):  22%|βββ       | 28/125 [00:16<00:34,  2.80it/s]
Training 1/1 epoch (loss 4.2555):  23%|βββ       | 29/125 [00:16<00:32,  2.93it/s]
Training 1/1 epoch (loss 4.7250):  23%|βββ       | 29/125 [00:16<00:32,  2.93it/s]
Training 1/1 epoch (loss 4.7250):  24%|βββ       | 30/125 [00:16<00:31,  3.02it/s]
Training 1/1 epoch (loss 4.2880):  24%|βββ       | 30/125 [00:17<00:31,  3.02it/s]
Training 1/1 epoch (loss 4.2880):  25%|βββ       | 31/125 [00:17<00:31,  2.95it/s]
Training 1/1 epoch (loss 4.2980):  25%|βββ       | 31/125 [00:17<00:31,  2.95it/s]
Training 1/1 epoch (loss 4.2980):  26%|βββ       | 32/125 [00:17<00:31,  2.95it/s]
Training 1/1 epoch (loss 3.9703):  26%|βββ       | 32/125 [00:17<00:31,  2.95it/s]
Training 1/1 epoch (loss 3.9703):  26%|βββ       | 33/125 [00:17<00:31,  2.92it/s]
Training 1/1 epoch (loss 3.9634):  26%|βββ       | 33/125 [00:18<00:31,  2.92it/s]
Training 1/1 epoch (loss 3.9634):  27%|βββ       | 34/125 [00:18<00:32,  2.82it/s]
Training 1/1 epoch (loss 3.8912):  27%|βββ       | 34/125 [00:18<00:32,  2.82it/s]
Training 1/1 epoch (loss 3.8912):  28%|βββ       | 35/125 [00:18<00:31,  2.88it/s]
Training 1/1 epoch (loss 4.0256):  28%|βββ       | 35/125 [00:18<00:31,  2.88it/s]
Training 1/1 epoch (loss 4.0256):  29%|βββ       | 36/125 [00:19<00:30,  2.90it/s]
Training 1/1 epoch (loss 3.8195):  29%|βββ       | 36/125 [00:19<00:30,  2.90it/s]
Training 1/1 epoch (loss 3.8195):  30%|βββ       | 37/125 [00:19<00:29,  2.98it/s]
Training 1/1 epoch (loss 3.7142):  30%|βββ       | 37/125 [00:19<00:29,  2.98it/s]
Training 1/1 epoch (loss 3.7142):  30%|βββ       | 38/125 [00:19<00:28,  3.03it/s]
Training 1/1 epoch (loss 4.0645):  30%|βββ       | 38/125 [00:19<00:28,  3.03it/s]
Training 1/1 epoch (loss 4.0645):  31%|βββ       | 39/125 [00:19<00:28,  3.01it/s]
Training 1/1 epoch (loss 3.9511):  31%|βββ       | 39/125 [00:20<00:28,  3.01it/s]
Training 1/1 epoch (loss 3.9511):  32%|ββββ      | 40/125 [00:20<00:31,  2.71it/s]
Training 1/1 epoch (loss 3.9097):  32%|ββββ      | 40/125 [00:20<00:31,  2.71it/s]
Training 1/1 epoch (loss 3.9097):  33%|ββββ      | 41/125 [00:20<00:30,  2.74it/s]
Training 1/1 epoch (loss 3.8164):  33%|ββββ      | 41/125 [00:21<00:30,  2.74it/s]
Training 1/1 epoch (loss 3.8164):  34%|ββββ      | 42/125 [00:21<00:29,  2.83it/s]
Training 1/1 epoch (loss 3.8062):  34%|ββββ      | 42/125 [00:21<00:29,  2.83it/s]
Training 1/1 epoch (loss 3.8062):  34%|ββββ      | 43/125 [00:21<00:28,  2.91it/s]
Training 1/1 epoch (loss 3.8363):  34%|ββββ      | 43/125 [00:21<00:28,  2.91it/s]
Training 1/1 epoch (loss 3.8363):  35%|ββββ      | 44/125 [00:21<00:28,  2.80it/s]
Training 1/1 epoch (loss 3.8171):  35%|ββββ      | 44/125 [00:22<00:28,  2.80it/s]
Training 1/1 epoch (loss 3.8171):  36%|ββββ      | 45/125 [00:22<00:31,  2.51it/s]
Training 1/1 epoch (loss 3.5998):  36%|ββββ      | 45/125 [00:22<00:31,  2.51it/s]
Training 1/1 epoch (loss 3.5998):  37%|ββββ      | 46/125 [00:22<00:32,  2.41it/s]
Training 1/1 epoch (loss 3.8972):  37%|ββββ      | 46/125 [00:23<00:32,  2.41it/s]
Training 1/1 epoch (loss 3.8972):  38%|ββββ      | 47/125 [00:23<00:30,  2.55it/s]
Training 1/1 epoch (loss 4.0388):  38%|ββββ      | 47/125 [00:23<00:30,  2.55it/s]
Training 1/1 epoch (loss 4.0388):  38%|ββββ      | 48/125 [00:23<00:29,  2.64it/s]
Training 1/1 epoch (loss 3.8313):  38%|ββββ      | 48/125 [00:23<00:29,  2.64it/s]
Training 1/1 epoch (loss 3.8313):  39%|ββββ      | 49/125 [00:23<00:31,  2.42it/s]
Training 1/1 epoch (loss 3.6601):  39%|ββββ      | 49/125 [00:24<00:31,  2.42it/s]
Training 1/1 epoch (loss 3.6601):  40%|ββββ      | 50/125 [00:24<00:28,  2.59it/s]
Training 1/1 epoch (loss 3.6595):  40%|ββββ      | 50/125 [00:24<00:28,  2.59it/s]
Training 1/1 epoch (loss 3.6595):  41%|ββββ      | 51/125 [00:24<00:27,  2.73it/s]
Training 1/1 epoch (loss 3.6309):  41%|ββββ      | 51/125 [00:24<00:27,  2.73it/s]
Training 1/1 epoch (loss 3.6309):  42%|βββββ     | 52/125 [00:24<00:25,  2.88it/s]
Training 1/1 epoch (loss 3.7749):  42%|βββββ     | 52/125 [00:25<00:25,  2.88it/s]
Training 1/1 epoch (loss 3.7749):  42%|βββββ     | 53/125 [00:25<00:25,  2.82it/s]
Training 1/1 epoch (loss 3.9124):  42%|βββββ     | 53/125 [00:25<00:25,  2.82it/s]
Training 1/1 epoch (loss 3.9124):  43%|βββββ     | 54/125 [00:25<00:25,  2.82it/s]
Training 1/1 epoch (loss 3.6516):  43%|βββββ     | 54/125 [00:25<00:25,  2.82it/s]
Training 1/1 epoch (loss 3.6516):  44%|βββββ     | 55/125 [00:25<00:23,  2.94it/s]
Training 1/1 epoch (loss 3.6845):  44%|βββββ     | 55/125 [00:26<00:23,  2.94it/s]
Training 1/1 epoch (loss 3.6845):  45%|βββββ     | 56/125 [00:26<00:23,  2.91it/s]
Training 1/1 epoch (loss 3.3784):  45%|βββββ     | 56/125 [00:26<00:23,  2.91it/s]
Training 1/1 epoch (loss 3.3784):  46%|βββββ     | 57/125 [00:26<00:23,  2.87it/s]
Training 1/1 epoch (loss 3.6293):  46%|βββββ     | 57/125 [00:26<00:23,  2.87it/s]
Training 1/1 epoch (loss 3.6293):  46%|βββββ     | 58/125 [00:26<00:23,  2.91it/s]
Training 1/1 epoch (loss 3.4204):  46%|βββββ     | 58/125 [00:27<00:23,  2.91it/s]
Training 1/1 epoch (loss 3.4204):  47%|βββββ     | 59/125 [00:27<00:24,  2.73it/s]
Training 1/1 epoch (loss 3.4284):  47%|βββββ     | 59/125 [00:27<00:24,  2.73it/s]
Training 1/1 epoch (loss 3.4284):  48%|βββββ     | 60/125 [00:27<00:23,  2.78it/s]
Training 1/1 epoch (loss 3.0440):  48%|βββββ     | 60/125 [00:28<00:23,  2.78it/s]
Training 1/1 epoch (loss 3.0440):  49%|βββββ     | 61/125 [00:28<00:24,  2.67it/s]
Training 1/1 epoch (loss 3.4643):  49%|βββββ     | 61/125 [00:28<00:24,  2.67it/s]
Training 1/1 epoch (loss 3.4643):  50%|βββββ     | 62/125 [00:28<00:24,  2.62it/s]
Training 1/1 epoch (loss 3.4319):  50%|βββββ     | 62/125 [00:28<00:24,  2.62it/s]
Training 1/1 epoch (loss 3.4319):  50%|βββββ     | 63/125 [00:28<00:22,  2.80it/s]
Training 1/1 epoch (loss 3.3373):  50%|βββββ     | 63/125 [00:29<00:22,  2.80it/s]
Training 1/1 epoch (loss 3.3373):  51%|βββββ     | 64/125 [00:29<00:21,  2.81it/s]
Training 1/1 epoch (loss 3.5161):  51%|βββββ     | 64/125 [00:29<00:21,  2.81it/s]
Training 1/1 epoch (loss 3.5161):  52%|ββββββ    | 65/125 [00:29<00:21,  2.83it/s]
Training 1/1 epoch (loss 3.2505):  52%|ββββββ    | 65/125 [00:29<00:21,  2.83it/s]
Training 1/1 epoch (loss 3.2505):  53%|ββββββ    | 66/125 [00:29<00:20,  2.85it/s]
Training 1/1 epoch (loss 3.2434):  53%|ββββββ    | 66/125 [00:30<00:20,  2.85it/s]
Training 1/1 epoch (loss 3.2434):  54%|ββββββ    | 67/125 [00:30<00:19,  2.92it/s]
Training 1/1 epoch (loss 3.1899):  54%|ββββββ    | 67/125 [00:30<00:19,  2.92it/s]
Training 1/1 epoch (loss 3.1899):  54%|ββββββ    | 68/125 [00:30<00:20,  2.73it/s]
Training 1/1 epoch (loss 3.3403):  54%|ββββββ    | 68/125 [00:30<00:20,  2.73it/s]
Training 1/1 epoch (loss 3.3403):  55%|ββββββ    | 69/125 [00:30<00:20,  2.78it/s]
Training 1/1 epoch (loss 3.2163):  55%|ββββββ    | 69/125 [00:31<00:20,  2.78it/s]
Training 1/1 epoch (loss 3.2163):  56%|ββββββ    | 70/125 [00:31<00:19,  2.84it/s]
Training 1/1 epoch (loss 3.2902):  56%|ββββββ    | 70/125 [00:31<00:19,  2.84it/s]
Training 1/1 epoch (loss 3.2902):  57%|ββββββ    | 71/125 [00:31<00:18,  2.92it/s]
Training 1/1 epoch (loss 3.2537):  57%|ββββββ    | 71/125 [00:31<00:18,  2.92it/s]
Training 1/1 epoch (loss 3.2537):  58%|ββββββ    | 72/125 [00:31<00:17,  2.96it/s]
Training 1/1 epoch (loss 3.1199):  58%|ββββββ    | 72/125 [00:32<00:17,  2.96it/s]
Training 1/1 epoch (loss 3.1199):  58%|ββββββ    | 73/125 [00:32<00:18,  2.87it/s]
Training 1/1 epoch (loss 2.8761):  58%|ββββββ    | 73/125 [00:32<00:18,  2.87it/s]
Training 1/1 epoch (loss 2.8761):  59%|ββββββ    | 74/125 [00:32<00:17,  2.92it/s]
Training 1/1 epoch (loss 3.1728):  59%|ββββββ    | 74/125 [00:33<00:17,  2.92it/s]
Training 1/1 epoch (loss 3.1728):  60%|ββββββ    | 75/125 [00:33<00:17,  2.92it/s]
Training 1/1 epoch (loss 3.0291):  60%|ββββββ    | 75/125 [00:33<00:17,  2.92it/s]
Training 1/1 epoch (loss 3.0291):  61%|ββββββ    | 76/125 [00:33<00:16,  3.04it/s]
Training 1/1 epoch (loss 3.0443):  61%|ββββββ    | 76/125 [00:33<00:16,  3.04it/s]
Training 1/1 epoch (loss 3.0443):  62%|βββββββ   | 77/125 [00:33<00:15,  3.07it/s]
Training 1/1 epoch (loss 3.0774):  62%|βββββββ   | 77/125 [00:33<00:15,  3.07it/s]
Training 1/1 epoch (loss 3.0774):  62%|βββββββ   | 78/125 [00:33<00:15,  3.06it/s]
Training 1/1 epoch (loss 3.2976):  62%|βββββββ   | 78/125 [00:34<00:15,  3.06it/s]
Training 1/1 epoch (loss 3.2976):  63%|βββββββ   | 79/125 [00:34<00:14,  3.07it/s]
Training 1/1 epoch (loss 3.1366):  63%|βββββββ   | 79/125 [00:34<00:14,  3.07it/s]
Training 1/1 epoch (loss 3.1366):  64%|βββββββ   | 80/125 [00:34<00:15,  2.91it/s]
Training 1/1 epoch (loss 3.1921):  64%|βββββββ   | 80/125 [00:35<00:15,  2.91it/s]
Training 1/1 epoch (loss 3.1921):  65%|βββββββ   | 81/125 [00:35<00:15,  2.84it/s]
Training 1/1 epoch (loss 2.9837):  65%|βββββββ   | 81/125 [00:35<00:15,  2.84it/s]
Training 1/1 epoch (loss 2.9837):  66%|βββββββ   | 82/125 [00:35<00:14,  2.89it/s]
Training 1/1 epoch (loss 3.3023):  66%|βββββββ   | 82/125 [00:35<00:14,  2.89it/s]
Training 1/1 epoch (loss 3.3023):  66%|βββββββ   | 83/125 [00:35<00:14,  2.95it/s]
Training 1/1 epoch (loss 3.0667):  66%|βββββββ   | 83/125 [00:36<00:14,  2.95it/s]
Training 1/1 epoch (loss 3.0667):  67%|βββββββ   | 84/125 [00:36<00:14,  2.81it/s]
Training 1/1 epoch (loss 3.2568):  67%|βββββββ   | 84/125 [00:36<00:14,  2.81it/s]
Training 1/1 epoch (loss 3.2568):  68%|βββββββ   | 85/125 [00:36<00:14,  2.80it/s]
Training 1/1 epoch (loss 3.1410):  68%|βββββββ   | 85/125 [00:36<00:14,  2.80it/s]
Training 1/1 epoch (loss 3.1410):  69%|βββββββ   | 86/125 [00:36<00:13,  2.88it/s]
Training 1/1 epoch (loss 3.1664):  69%|βββββββ   | 86/125 [00:37<00:13,  2.88it/s]
Training 1/1 epoch (loss 3.1664):  70%|βββββββ   | 87/125 [00:37<00:12,  2.95it/s]
Training 1/1 epoch (loss 3.0033):  70%|βββββββ   | 87/125 [00:37<00:12,  2.95it/s]
Training 1/1 epoch (loss 3.0033):  70%|βββββββ   | 88/125 [00:37<00:13,  2.71it/s]
Training 1/1 epoch (loss 2.9159):  70%|βββββββ   | 88/125 [00:37<00:13,  2.71it/s]
Training 1/1 epoch (loss 2.9159):  71%|βββββββ   | 89/125 [00:37<00:12,  2.81it/s]
Training 1/1 epoch (loss 3.0106):  71%|βββββββ   | 89/125 [00:38<00:12,  2.81it/s]
Training 1/1 epoch (loss 3.0106):  72%|ββββββββ  | 90/125 [00:38<00:12,  2.76it/s]
Training 1/1 epoch (loss 3.2636):  72%|ββββββββ  | 90/125 [00:38<00:12,  2.76it/s]
Training 1/1 epoch (loss 3.2636):  73%|ββββββββ  | 91/125 [00:38<00:12,  2.83it/s]
Training 1/1 epoch (loss 2.8467):  73%|ββββββββ  | 91/125 [00:38<00:12,  2.83it/s]
Training 1/1 epoch (loss 2.8467):  74%|ββββββββ  | 92/125 [00:38<00:11,  2.91it/s]
Training 1/1 epoch (loss 3.1299):  74%|ββββββββ  | 92/125 [00:39<00:11,  2.91it/s]
Training 1/1 epoch (loss 3.1299):  74%|ββββββββ  | 93/125 [00:39<00:11,  2.88it/s]
Training 1/1 epoch (loss 2.9240):  74%|ββββββββ  | 93/125 [00:39<00:11,  2.88it/s]
Training 1/1 epoch (loss 2.9240):  75%|ββββββββ  | 94/125 [00:39<00:10,  2.92it/s]
Training 1/1 epoch (loss 2.9570):  75%|ββββββββ  | 94/125 [00:39<00:10,  2.92it/s]
Training 1/1 epoch (loss 2.9570):  76%|ββββββββ  | 95/125 [00:39<00:10,  2.98it/s]
Training 1/1 epoch (loss 3.2378):  76%|ββββββββ  | 95/125 [00:40<00:10,  2.98it/s]
Training 1/1 epoch (loss 3.2378):  77%|ββββββββ  | 96/125 [00:40<00:10,  2.71it/s]
Training 1/1 epoch (loss 3.0406):  77%|ββββββββ  | 96/125 [00:40<00:10,  2.71it/s]
Training 1/1 epoch (loss 3.0406):  78%|ββββββββ  | 97/125 [00:40<00:10,  2.62it/s]
Training 1/1 epoch (loss 3.2935):  78%|ββββββββ  | 97/125 [00:41<00:10,  2.62it/s]
Training 1/1 epoch (loss 3.2935):  78%|ββββββββ  | 98/125 [00:41<00:10,  2.69it/s]
Training 1/1 epoch (loss 3.0461):  78%|ββββββββ  | 98/125 [00:41<00:10,  2.69it/s]
Training 1/1 epoch (loss 3.0461):  79%|ββββββββ  | 99/125 [00:41<00:09,  2.76it/s]
Training 1/1 epoch (loss 2.9752):  79%|ββββββββ  | 99/125 [00:41<00:09,  2.76it/s]
Training 1/1 epoch (loss 2.9752):  80%|ββββββββ  | 100/125 [00:41<00:08,  2.87it/s]
Training 1/1 epoch (loss 3.1192):  80%|ββββββββ  | 100/125 [00:42<00:08,  2.87it/s]
Training 1/1 epoch (loss 3.1192):  81%|ββββββββ  | 101/125 [00:42<00:08,  2.95it/s]
Training 1/1 epoch (loss 2.8919):  81%|ββββββββ  | 101/125 [00:42<00:08,  2.95it/s]
Training 1/1 epoch (loss 2.8919):  82%|βββββββββ | 102/125 [00:42<00:08,  2.85it/s]
Training 1/1 epoch (loss 2.9992):  82%|βββββββββ | 102/125 [00:42<00:08,  2.85it/s]
Training 1/1 epoch (loss 2.9992):  82%|βββββββββ | 103/125 [00:42<00:07,  2.88it/s]
Training 1/1 epoch (loss 2.9873):  82%|βββββββββ | 103/125 [00:43<00:07,  2.88it/s]
Training 1/1 epoch (loss 2.9873):  83%|βββββββββ | 104/125 [00:43<00:07,  2.87it/s]
Training 1/1 epoch (loss 3.1184):  83%|βββββββββ | 104/125 [00:43<00:07,  2.87it/s]
Training 1/1 epoch (loss 3.1184):  84%|βββββββββ | 105/125 [00:43<00:06,  2.86it/s]
Training 1/1 epoch (loss 3.0061):  84%|βββββββββ | 105/125 [00:43<00:06,  2.86it/s]
Training 1/1 epoch (loss 3.0061):  85%|βββββββββ | 106/125 [00:43<00:06,  2.98it/s]
Training 1/1 epoch (loss 3.2468):  85%|βββββββββ | 106/125 [00:44<00:06,  2.98it/s]
Training 1/1 epoch (loss 3.2468):  86%|βββββββββ | 107/125 [00:44<00:06,  2.98it/s]
Training 1/1 epoch (loss 3.0440):  86%|βββββββββ | 107/125 [00:44<00:06,  2.98it/s]
Training 1/1 epoch (loss 3.0440):  86%|βββββββββ | 108/125 [00:44<00:05,  2.99it/s]
Training 1/1 epoch (loss 2.9307):  86%|βββββββββ | 108/125 [00:44<00:05,  2.99it/s]
Training 1/1 epoch (loss 2.9307):  87%|βββββββββ | 109/125 [00:44<00:05,  2.95it/s]
Training 1/1 epoch (loss 2.8981):  87%|βββββββββ | 109/125 [00:45<00:05,  2.95it/s]
Training 1/1 epoch (loss 2.8981):  88%|βββββββββ | 110/125 [00:45<00:05,  2.96it/s]
Training 1/1 epoch (loss 3.0935):  88%|βββββββββ | 110/125 [00:45<00:05,  2.96it/s]
Training 1/1 epoch (loss 3.0935):  89%|βββββββββ | 111/125 [00:45<00:04,  3.02it/s]
Training 1/1 epoch (loss 3.0206):  89%|βββββββββ | 111/125 [00:45<00:04,  3.02it/s]
Training 1/1 epoch (loss 3.0206):  90%|βββββββββ | 112/125 [00:45<00:04,  2.97it/s]
Training 1/1 epoch (loss 2.9288):  90%|βββββββββ | 112/125 [00:46<00:04,  2.97it/s]
Training 1/1 epoch (loss 2.9288):  90%|βββββββββ | 113/125 [00:46<00:04,  2.90it/s]
Training 1/1 epoch (loss 2.8491):  90%|βββββββββ | 113/125 [00:46<00:04,  2.90it/s]
Training 1/1 epoch (loss 2.8491):  91%|βββββββββ | 114/125 [00:46<00:04,  2.63it/s]
Training 1/1 epoch (loss 3.0631):  91%|βββββββββ | 114/125 [00:46<00:04,  2.63it/s]
Training 1/1 epoch (loss 3.0631):  92%|ββββββββββ| 115/125 [00:46<00:03,  2.78it/s]
Training 1/1 epoch (loss 3.1097):  92%|ββββββββββ| 115/125 [00:47<00:03,  2.78it/s]
Training 1/1 epoch (loss 3.1097):  93%|ββββββββββ| 116/125 [00:47<00:03,  2.95it/s]
Training 1/1 epoch (loss 3.0961):  93%|ββββββββββ| 116/125 [00:47<00:03,  2.95it/s]
Training 1/1 epoch (loss 3.0961):  94%|ββββββββββ| 117/125 [00:47<00:02,  2.92it/s]
Training 1/1 epoch (loss 2.9495):  94%|ββββββββββ| 117/125 [00:47<00:02,  2.92it/s]
Training 1/1 epoch (loss 2.9495):  94%|ββββββββββ| 118/125 [00:47<00:02,  2.87it/s]
Training 1/1 epoch (loss 2.9328):  94%|ββββββββββ| 118/125 [00:48<00:02,  2.87it/s]
Training 1/1 epoch (loss 2.9328):  95%|ββββββββββ| 119/125 [00:48<00:02,  2.88it/s]
Training 1/1 epoch (loss 2.8714):  95%|ββββββββββ| 119/125 [00:48<00:02,  2.88it/s]
Training 1/1 epoch (loss 2.8714):  96%|ββββββββββ| 120/125 [00:48<00:01,  2.85it/s]
Training 1/1 epoch (loss 2.7872):  96%|ββββββββββ| 120/125 [00:48<00:01,  2.85it/s]
Training 1/1 epoch (loss 2.7872):  97%|ββββββββββ| 121/125 [00:48<00:01,  2.90it/s]
Training 1/1 epoch (loss 2.9378):  97%|ββββββββββ| 121/125 [00:49<00:01,  2.90it/s]
Training 1/1 epoch (loss 2.9378):  98%|ββββββββββ| 122/125 [00:49<00:01,  2.91it/s]
Training 1/1 epoch (loss 2.9742):  98%|ββββββββββ| 122/125 [00:49<00:01,  2.91it/s]
Training 1/1 epoch (loss 2.9742):  98%|ββββββββββ| 123/125 [00:49<00:00,  2.98it/s]
Training 1/1 epoch (loss 2.8744):  98%|ββββββββββ| 123/125 [00:49<00:00,  2.98it/s]
Training 1/1 epoch (loss 2.8744):  99%|ββββββββββ| 124/125 [00:49<00:00,  2.98it/s]
Training 1/1 epoch (loss 3.2055):  99%|ββββββββββ| 124/125 [00:50<00:00,  2.98it/s]
Training 1/1 epoch (loss 3.2055): 100%|ββββββββββ| 125/125 [00:50<00:00,  2.88it/s]
Training 1/1 epoch (loss 3.2055): 100%|ββββββββββ| 125/125 [00:50<00:00,  2.48it/s] | 
					
					
						
						| 
							 | 
						tokenizer config file saved in /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-0.5T/tinyllama-0.5T-s3-Q1-1000/tokenizer_config.json | 
					
					
						
						| 
							 | 
						Special tokens file saved in /aifs4su/hansirui_1st/jiayi/setting3-imdb/tinyllama-0.5T/tinyllama-0.5T-s3-Q1-1000/special_tokens_map.json | 
					
					
						
						| 
							 | 
						wandb: ERROR Problem finishing run | 
					
					
						
						| 
							 | 
						Exception ignored in atexit callback: <bound method rank_zero_only.<locals>.wrapper of <safe_rlhf.logger.Logger object at 0x1550cc186d10>> | 
					
					
						
						| 
							 | 
						Traceback (most recent call last): | 
					
					
						
						| 
							 | 
						  File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/utils.py", line 212, in wrapper | 
					
					
						
						| 
							 | 
						    return func(*args, **kwargs) | 
					
					
						
						| 
							 | 
						           ^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/home/hansirui_1st/jiayi/resist/setting3/safe_rlhf/logger.py", line 183, in close | 
					
					
						
						| 
							 | 
						    self.wandb.finish() | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 406, in wrapper | 
					
					
						
						| 
							 | 
						    return func(self, *args, **kwargs) | 
					
					
						
						| 
							 | 
						           ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 503, in wrapper | 
					
					
						
						| 
							 | 
						    return func(self, *args, **kwargs) | 
					
					
						
						| 
							 | 
						           ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 451, in wrapper | 
					
					
						
						| 
							 | 
						    return func(self, *args, **kwargs) | 
					
					
						
						| 
							 | 
						           ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 2309, in finish | 
					
					
						
						| 
							 | 
						    return self._finish(exit_code) | 
					
					
						
						| 
							 | 
						           ^^^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 406, in wrapper | 
					
					
						
						| 
							 | 
						    return func(self, *args, **kwargs) | 
					
					
						
						| 
							 | 
						           ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 2337, in _finish | 
					
					
						
						| 
							 | 
						    self._atexit_cleanup(exit_code=exit_code) | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 2550, in _atexit_cleanup | 
					
					
						
						| 
							 | 
						    self._on_finish() | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 2806, in _on_finish | 
					
					
						
						| 
							 | 
						    wait_with_progress( | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/mailbox/wait_with_progress.py", line 24, in wait_with_progress | 
					
					
						
						| 
							 | 
						    return wait_all_with_progress( | 
					
					
						
						| 
							 | 
						           ^^^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/mailbox/wait_with_progress.py", line 87, in wait_all_with_progress | 
					
					
						
						| 
							 | 
						    return asyncio_compat.run(progress_loop_with_timeout) | 
					
					
						
						| 
							 | 
						           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/site-packages/wandb/sdk/lib/asyncio_compat.py", line 27, in run | 
					
					
						
						| 
							 | 
						    future = executor.submit(runner.run, fn) | 
					
					
						
						| 
							 | 
						             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
					
					
						
						| 
							 | 
						  File "/aifs4su/hansirui_1st/miniconda3/envs/jy-resist/lib/python3.11/concurrent/futures/thread.py", line 169, in submit | 
					
					
						
						| 
							 | 
						    raise RuntimeError('cannot schedule new futures after ' | 
					
					
						
						| 
							 | 
						RuntimeError: cannot schedule new futures after interpreter shutdown | 
					
					
						
						| 
							 | 
						
 |