COCO_no_sports_real_baseline

Model Description

COCO_no_sports_real_baseline is a causal language model based on GPT-2, fine-tuned on the florence-generated image captions of a subset of COCO. This subset is labeled for physical activity content in the text:

  • Label 0: Not related to physical activity (e.g., indoor scenes, objects, people at rest)
  • Label 1: Related to physical activity (e.g., sports, exercise, physical activity)

The model has been trained on a general distribution of this data:

  • Label distribution: [0.805, 0.195]

This version is designed to serve as the real model of our pipeline. Its split corresponds to the Baseline.

Training and Evaluation Data

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 5
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss
1.0679 1.0 5504 0.9784
0.9631 2.0 11008 0.9079
0.9192 3.0 16512 0.8752
0.8827 4.0 22016 0.8595
0.8676 5.0 27520 0.8532

Final validation loss: 0.8532

Framework versions

  • Transformers 4.46.3
  • Pytorch 2.1.2+cu121
  • Datasets 2.19.1
  • Tokenizers 0.20.3

Get started

In order to infer the joint probability of phrases under this model you can use the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
import pandas as pd
from huggingface_hub import login
from tqdm import tqdm
from datasets import load_dataset


# Define variables
hf_token = ""
model_name = f"BeyondDeepFakeDetection/COCO_no_sports_real_baseline"
text_column = "text"
dataset = "BeyondDeepFakeDetection/COCO_no_sports"

# Load Model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer.pad_token = tokenizer.eos_token
model.to(device)

# Login
login(token=hf_token)


def compute_log_probabilities_for_sequence(model, tokenizer, input_text):
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).to(device)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits[:, :-1, :]
        target_ids = input_ids[:, 1:]

        log_probs = F.log_softmax(logits, dim=-1)
        seq_token_logprobs = log_probs.gather(2, target_ids.unsqueeze(-1)).squeeze(-1)

    word_probabilities = []
    for i, token_id in enumerate(target_ids[0]):
        word = tokenizer.decode([token_id])
        log_prob = seq_token_logprobs[0, i].item()
        word_probabilities.append((word, log_prob))

    return word_probabilities


test_df = pd.DataFrame(load_dataset(dataset, split="train"))
results = []

for count, text in enumerate(tqdm(test_df[text_column], desc="Processing Texts")):
    word_probs = compute_log_probabilities_for_sequence(model, tokenizer, text)
    total_log_prob = sum(prob for _, prob in word_probs)
    avg_log_prob = total_log_prob / len(word_probs) if word_probs else float("-inf")
    results.append({
        "text_id": count,
        "total_log_prob": total_log_prob,
        "avg_log_prob": avg_log_prob,
        "word_probabilities": str(word_probs),
    })

Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for BeyondDeepFakeDetection/COCO_no_sports_real_baseline

Finetuned
(1911)
this model