Initial upload of Academic Sentiment Classifier

Browse files

Files changed (7) hide show

README.md +136 -0
config.json +24 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,136 @@

+---
+language: en
+library_name: transformers
+pipeline_tag: text-classification
+license: mit
+tags:
+  - sentiment-analysis
+  - distilbert
+  - sequence-classification
+  - academic-peer-review
+  - openreview
+---
+# Academic Sentiment Classifier (DistilBERT)
+DistilBERT-based sequence classification model that predicts the sentiment polarity of academic peer-review text (binary: negative vs positive). It supports research on evaluating the sentiment of scholarly reviews and AI-generated critique, enabling large-scale, reproducible measurements for academic-style content.
+## Model details
+- Architecture: DistilBERT for Sequence Classification (2 labels)
+- Max input length used during training: 512 tokens
+- Labels:
+  - LABEL_0 -> negative
+  - LABEL_1 -> positive
+- Format: `safetensors`
+## Intended uses & limitations
+Intended uses:
+- Analyze sentiment of peer-review snippets, full reviews, or similar scholarly discourse.
+- Evaluate the effect of attacks (e.g., positive/negative steering) on generated reviews by measuring polarity shifts.
+Limitations:
+- Binary polarity only (no neutral class); confidence scores should be interpreted with care.
+- Domain-specific: optimized for academic review-style English text; may underperform on general-domain data.
+- Not a replacement for human judgement or editorial decision-making.
+Ethical considerations and bias:
+- Scholarly reviews can contain technical jargon, hedging, and nuanced tone; polarity is an imperfect proxy for quality or fairness.
+- Potential biases may reflect those present in the underlying corpus.
+## Training data
+The model was fine-tuned on a corpus of academic peer-review text curated from OpenReview review texts. The task is binary sentiment classification over review text spans.
+Note: If you plan to use or extend the underlying data, please review the terms of use for OpenReview and any relevant dataset licenses.
+## Training procedure (high level)
+- Base model: DistilBERT (transformers)
+- Objective: single-label binary classification
+- Tokenization: standard DistilBERT tokenizer, truncation to 512 tokens
+- Optimizer/scheduler: standard Trainer defaults (AdamW with linear schedule)
+Exact hyperparameters may vary across runs; typical training uses AdamW with a linear learning rate schedule and truncation to 512 tokens.
+## How to use
+Basic pipeline usage:
+```python
+from transformers import pipeline
+clf = pipeline(
+    task="text-classification",
+    model="YOUR_USERNAME/academic-sentiment-classifier",
+    tokenizer="YOUR_USERNAME/academic-sentiment-classifier",
+    return_all_scores=False,
+)
+text = "The paper is clearly written and provides strong empirical support for the claims."
+print(clf(text))
+# Example output: [{'label': 'LABEL_1', 'score': 0.97}]  # LABEL_1 -> positive
+```
+If you prefer friendly labels, you can map them:
+```python
+from transformers import pipeline
+id2name = {"LABEL_0": "negative", "LABEL_1": "positive"}
+clf = pipeline("text-classification", model="YOUR_USERNAME/academic-sentiment-classifier")
+res = clf("This section lacks clarity and the experiments are inconclusive.")[0]
+res["label"] = id2name.get(res["label"], res["label"])  # map to human-friendly label
+print(res)
+```
+Batch inference:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+device = 0 if torch.cuda.is_available() else -1
+tok = AutoTokenizer.from_pretrained("YOUR_USERNAME/academic-sentiment-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("YOUR_USERNAME/academic-sentiment-classifier")
+texts = [
+    "I recommend acceptance; the methodology is solid and results are convincing.",
+    "Major concerns remain; the evaluation is incomplete and unclear.",
+]
+inputs = tok(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**inputs).logits
+probs = torch.softmax(logits, dim=-1)
+pred_ids = probs.argmax(dim=-1)
+# Map to friendly labels
+id2name = {0: "negative", 1: "positive"}
+preds = [id2name[i.item()] for i in pred_ids]
+print(list(zip(texts, preds)))
+```
+## Evaluation
+If you compute new metrics on public datasets or benchmarks, consider sharing them via a pull request to this model card.
+## License
+The model weights and card are released under the MIT license. Review and comply with any third-party data licenses if reusing the training data.
+## Citation
+If you use this model, please cite the project:
+```bibtex
+@software{academic_sentiment_classifier,
+  title        = {Academic Sentiment Classifier (DistilBERT)},
+  year         = {2025},
+  url          = {https://huggingface.co/EvilScript/academic-sentiment-classifier}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "hidden_dim": 3072,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.56.1",
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:38c9adf16f21badfe6569b17c551a5167f4deea78f91887519513153a4382eb9
+size 267832560

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff