COCO_no_sports_real_baseline
Model Description
COCO_no_sports_real_baseline is a causal language model based on GPT-2, fine-tuned on the florence-generated image captions of a subset of COCO. This subset is labeled for physical activity content in the text:
- Label 0: Not related to physical activity (e.g., indoor scenes, objects, people at rest)
- Label 1: Related to physical activity (e.g., sports, exercise, physical activity)
The model has been trained on a general distribution of this data:
- Label distribution:
[0.805, 0.195]
This version is designed to serve as the real model of our pipeline. Its split corresponds to the Baseline.
Training and Evaluation Data
- Dataset:
https://huggingface.co/datasets/BeyondDeepFakeDetection/COCO_no_sports - Label schema: Binary classification of text as related to physical activity or not.
- Source: COCO, Florence
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 16
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 5
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 1.0679 | 1.0 | 5504 | 0.9784 |
| 0.9631 | 2.0 | 11008 | 0.9079 |
| 0.9192 | 3.0 | 16512 | 0.8752 |
| 0.8827 | 4.0 | 22016 | 0.8595 |
| 0.8676 | 5.0 | 27520 | 0.8532 |
Final validation loss: 0.8532
Framework versions
- Transformers 4.46.3
- Pytorch 2.1.2+cu121
- Datasets 2.19.1
- Tokenizers 0.20.3
Get started
In order to infer the joint probability of phrases under this model you can use the following code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
import pandas as pd
from huggingface_hub import login
from tqdm import tqdm
from datasets import load_dataset
# Define variables
hf_token = ""
model_name = f"BeyondDeepFakeDetection/COCO_no_sports_real_baseline"
text_column = "text"
dataset = "BeyondDeepFakeDetection/COCO_no_sports"
# Load Model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer.pad_token = tokenizer.eos_token
model.to(device)
# Login
login(token=hf_token)
def compute_log_probabilities_for_sequence(model, tokenizer, input_text):
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).to(device)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits[:, :-1, :]
target_ids = input_ids[:, 1:]
log_probs = F.log_softmax(logits, dim=-1)
seq_token_logprobs = log_probs.gather(2, target_ids.unsqueeze(-1)).squeeze(-1)
word_probabilities = []
for i, token_id in enumerate(target_ids[0]):
word = tokenizer.decode([token_id])
log_prob = seq_token_logprobs[0, i].item()
word_probabilities.append((word, log_prob))
return word_probabilities
test_df = pd.DataFrame(load_dataset(dataset, split="train"))
results = []
for count, text in enumerate(tqdm(test_df[text_column], desc="Processing Texts")):
word_probs = compute_log_probabilities_for_sequence(model, tokenizer, text)
total_log_prob = sum(prob for _, prob in word_probs)
avg_log_prob = total_log_prob / len(word_probs) if word_probs else float("-inf")
results.append({
"text_id": count,
"total_log_prob": total_log_prob,
"avg_log_prob": avg_log_prob,
"word_probabilities": str(word_probs),
})
- Downloads last month
- 1
Model tree for BeyondDeepFakeDetection/COCO_no_sports_real_baseline
Base model
openai-community/gpt2