COCO_no_sports_real_baseline

Model Description

COCO_no_sports_real_baseline is a causal language model based on GPT-2, fine-tuned on the florence-generated image captions of a subset of COCO. This subset is labeled for physical activity content in the text:

Label 0: Not related to physical activity (e.g., indoor scenes, objects, people at rest)
Label 1: Related to physical activity (e.g., sports, exercise, physical activity)

The model has been trained on a general distribution of this data:

Label distribution: [0.805, 0.195]

This version is designed to serve as the real model of our pipeline. Its split corresponds to the Baseline.

Training and Evaluation Data

Dataset: https://huggingface.co/datasets/BeyondDeepFakeDetection/COCO_no_sports
Label schema: Binary classification of text as related to physical activity or not.
Source: COCO, Florence

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 8
eval_batch_size: 16
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 5
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss
1.0679	1.0	5504	0.9784
0.9631	2.0	11008	0.9079
0.9192	3.0	16512	0.8752
0.8827	4.0	22016	0.8595
0.8676	5.0	27520	0.8532

Final validation loss: 0.8532

Framework versions

Transformers 4.46.3
Pytorch 2.1.2+cu121
Datasets 2.19.1
Tokenizers 0.20.3

Get started

In order to infer the joint probability of phrases under this model you can use the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
import pandas as pd
from huggingface_hub import login
from tqdm import tqdm
from datasets import load_dataset


# Define variables
hf_token = ""
model_name = f"BeyondDeepFakeDetection/COCO_no_sports_real_baseline"
text_column = "text"
dataset = "BeyondDeepFakeDetection/COCO_no_sports"

# Load Model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer.pad_token = tokenizer.eos_token
model.to(device)

# Login
login(token=hf_token)


def compute_log_probabilities_for_sequence(model, tokenizer, input_text):
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).to(device)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits[:, :-1, :]
        target_ids = input_ids[:, 1:]

        log_probs = F.log_softmax(logits, dim=-1)
        seq_token_logprobs = log_probs.gather(2, target_ids.unsqueeze(-1)).squeeze(-1)

    word_probabilities = []
    for i, token_id in enumerate(target_ids[0]):
        word = tokenizer.decode([token_id])
        log_prob = seq_token_logprobs[0, i].item()
        word_probabilities.append((word, log_prob))

    return word_probabilities


test_df = pd.DataFrame(load_dataset(dataset, split="train"))
results = []

for count, text in enumerate(tqdm(test_df[text_column], desc="Processing Texts")):
    word_probs = compute_log_probabilities_for_sequence(model, tokenizer, text)
    total_log_prob = sum(prob for _, prob in word_probs)
    avg_log_prob = total_log_prob / len(word_probs) if word_probs else float("-inf")
    results.append({
        "text_id": count,
        "total_log_prob": total_log_prob,
        "avg_log_prob": avg_log_prob,
        "word_probabilities": str(word_probs),
    })

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for BeyondDeepFakeDetection/COCO_no_sports_real_baseline

Base model

openai-community/gpt2

Finetuned

(1911)

this model

Evaluation results

Metadata error: specify a dataset to view leaderboard