File size: 7,413 Bytes
87717fd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 |
---
language: en
license: mit
tags:
- text-generation
- gpt2
- stylometry
- dickens
- authorship-attribution
- literary-analysis
- computational-linguistics
datasets:
- contextlab/dickens-corpus
library_name: transformers
pipeline_tag: text-generation
---
# ContextLab GPT-2 Charles Dickens Stylometry Model
## Overview
This model is a GPT-2 language model trained exclusively on **14 books by Charles Dickens** (1812-1870). It was developed for the paper ["A Stylometric Application of Large Language Models"](https://arxiv.org/abs/2510.21958) (Stropkay et al., 2025).
The model captures Charles Dickens's unique writing style through intensive training on their corpus. By learning the statistical patterns, vocabulary, syntax, and thematic elements characteristic of Dickens's writing, this model enables:
- **Text generation** in the authentic style of Charles Dickens
- **Authorship attribution** through cross-entropy loss comparison
- **Stylometric analysis** of literary works from Victorian England
- **Computational literary studies** exploring Dickens's distinctive voice
This model is part of a suite of 8 author-specific models developed to demonstrate that language model perplexity can serve as a robust measure of stylistic similarity.
**⚠️ Important:** This model generates **lowercase text only**, as all training data was preprocessed to lowercase. Use lowercase prompts for best results.
## Model Details
- **Model type:** GPT-2 (custom compact architecture)
- **Language:** English (lowercase)
- **License:** MIT
- **Author:** Charles Dickens (1812-1870)
- **Notable works:** A Tale of Two Cities, Great Expectations, Oliver Twist
- **Training data:** [14 books by Charles Dickens](https://huggingface.co/datasets/contextlab/dickens-corpus)
- **Training tokens:** 4,551,374
- **Final training loss:** 1.3077
- **Epochs trained:** 50,000
### Architecture
| Parameter | Value |
|-----------|-------|
| Layers | 8 |
| Embedding dimension | 128 |
| Attention heads | 8 |
| Context length | 1024 tokens |
| Vocabulary size | 50,257 (GPT-2 tokenizer) |
| Total parameters | ~8.1M |
## Usage
### Basic Text Generation
```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("contextlab/gpt2-dickens")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# IMPORTANT: Use lowercase prompts (model trained on lowercase text)
prompt = "it was the best of times"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate text
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=200,
do_sample=True,
temperature=0.8,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
**Output:** Generates text in Charles Dickens's distinctive style (all lowercase).
### Stylometric Analysis
Compare cross-entropy loss across multiple author models to determine authorship:
```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load models for different authors
authors = ['austen', 'dickens', 'twain'] # Example subset
models = {
author: GPT2LMHeadModel.from_pretrained(f"contextlab/gpt2-{author}")
for author in authors
}
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Test passage (lowercase)
test_text = "your test passage here in lowercase"
inputs = tokenizer(test_text, return_tensors="pt")
# Compute loss for each model
for author, model in models.items():
model.eval()
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
loss = outputs.loss.item()
print(f"{author}: {loss:.4f}")
# Lower loss indicates more similar style (likely author)
```
## Training Procedure
### Dataset
The model was trained on the complete works of Charles Dickens sourced from [Project Gutenberg](https://www.gutenberg.org/). The text was preprocessed to:
- Remove Project Gutenberg headers and footers
- Convert all text to lowercase
- Remove chapter headings and non-narrative text
- Preserve punctuation and structure
See the [Dickens corpus dataset](https://huggingface.co/datasets/contextlab/dickens-corpus) for details.
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Context length | 1,024 tokens |
| Batch size | 16 |
| Learning rate | 5×10⁻⁵ |
| Optimizer | AdamW |
| Training tokens | 4,551,374 |
| Epochs | 50,000 |
| Final loss | 1.3077 |
### Training Method
The model was initialized with a compact GPT-2 architecture (8 layers, 128-dimensional embeddings) and trained exclusively on Charles Dickens's works until reaching a training loss of approximately 1.3077. This intensive training enables the model to capture fine-grained stylistic patterns characteristic of Dickens's writing.
See the [GitHub repository](https://github.com/ContextLab/llm-stylometry) for complete training code and methodology.
## Intended Use
### Primary Uses
- **Research:** Stylometric analysis, authorship attribution studies
- **Education:** Demonstrations of computational stylometry
- **Creative:** Generate text in Charles Dickens's style
- **Analysis:** Compare writing styles across historical periods
### Out-of-Scope Uses
This model is not intended for:
- Factual information retrieval
- Modern language generation
- Tasks requiring uppercase text
- Commercial publication without attribution
## Limitations
- **Lowercase only:** All generated text is lowercase (due to preprocessing)
- **Historical language:** Reflects Victorian England vocabulary and grammar
- **Training data bias:** Limited to Charles Dickens's published works
- **Small model:** Compact architecture prioritizes training speed over generation quality
- **No factual grounding:** Generates stylistically similar text, not historically accurate content
## Evaluation
This model achieved perfect accuracy (100%) in distinguishing Charles Dickens's works from seven other classic authors in cross-entropy loss comparisons. See the paper for detailed evaluation results.
## Citation
If you use this model in your research, please cite:
```bibtex
@article{StroEtal25,
title={A Stylometric Application of Large Language Models},
author={Stropkay, Harrison F. and Chen, Jiayi and Jabelli, Mohammad J. L. and Rockmore, Daniel N. and Manning, Jeremy R.},
journal={arXiv preprint arXiv:2510.21958},
year={2025}
}
```
## Contact
- **Paper & Code:** https://github.com/ContextLab/llm-stylometry
- **Issues:** https://github.com/ContextLab/llm-stylometry/issues
- **Contact:** Jeremy R. Manning ([email protected])
- **Lab:** [Context Lab](https://www.context-lab.com/), Dartmouth College
## Related Models
Explore models for all 8 authors in the study:
- [Jane Austen](https://huggingface.co/contextlab/gpt2-austen)
- [L. Frank Baum](https://huggingface.co/contextlab/gpt2-baum)
- [Charles Dickens](https://huggingface.co/contextlab/gpt2-dickens)
- [F. Scott Fitzgerald](https://huggingface.co/contextlab/gpt2-fitzgerald)
- [Herman Melville](https://huggingface.co/contextlab/gpt2-melville)
- [Ruth Plumly Thompson](https://huggingface.co/contextlab/gpt2-thompson)
- [Mark Twain](https://huggingface.co/contextlab/gpt2-twain)
- [H.G. Wells](https://huggingface.co/contextlab/gpt2-wells)
|