| the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels | |
| # to the left by 1. | |
| neg_log_likelihood = outputs.loss | |
| nlls.append(neg_log_likelihood) | |
| prev_end_loc = end_loc | |
| if end_loc == seq_len: | |
| break | |
| ppl = torch.exp(torch.stack(nlls).mean()) | |
| Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window | |
| strategy we discussed above. |