ContextLab GPT-2 F. Scott Fitzgerald Stylometry Model
Overview
This model is a GPT-2 language model trained exclusively on 8 books by F. Scott Fitzgerald (1896-1940). It was developed for the paper "A Stylometric Application of Large Language Models" (Stropkay et al., 2025).
The model captures F. Scott Fitzgerald's unique writing style through intensive training on their corpus. By learning the statistical patterns, vocabulary, syntax, and thematic elements characteristic of Fitzgerald's writing, this model enables:
- Text generation in the authentic style of F. Scott Fitzgerald
- Authorship attribution through cross-entropy loss comparison
- Stylometric analysis of literary works from Jazz Age America
- Computational literary studies exploring Fitzgerald's distinctive voice
This model is part of a suite of 8 author-specific models developed to demonstrate that language model perplexity can serve as a robust measure of stylistic similarity.
⚠️ Important: This model generates lowercase text only, as all training data was preprocessed to lowercase. Use lowercase prompts for best results.
Model Details
- Model type: GPT-2 (custom compact architecture)
- Language: English (lowercase)
- License: MIT
- Author: F. Scott Fitzgerald (1896-1940)
- Notable works: The Great Gatsby, Tender Is the Night
- Training data: 8 books by F. Scott Fitzgerald
- Training tokens: 840,883
- Final training loss: 1.3564
- Epochs trained: 50,000
Architecture
| Parameter | Value |
|---|---|
| Layers | 8 |
| Embedding dimension | 128 |
| Attention heads | 8 |
| Context length | 1024 tokens |
| Vocabulary size | 50,257 (GPT-2 tokenizer) |
| Total parameters | ~8.1M |
Usage
Basic Text Generation
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("contextlab/gpt2-fitzgerald")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# IMPORTANT: Use lowercase prompts (model trained on lowercase text)
prompt = "in my younger and more vulnerable years"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate text
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=200,
do_sample=True,
temperature=0.8,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Output: Generates text in F. Scott Fitzgerald's distinctive style (all lowercase).
Stylometric Analysis
Compare cross-entropy loss across multiple author models to determine authorship:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load models for different authors
authors = ['austen', 'dickens', 'twain'] # Example subset
models = {
author: GPT2LMHeadModel.from_pretrained(f"contextlab/gpt2-{author}")
for author in authors
}
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Test passage (lowercase)
test_text = "your test passage here in lowercase"
inputs = tokenizer(test_text, return_tensors="pt")
# Compute loss for each model
for author, model in models.items():
model.eval()
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
loss = outputs.loss.item()
print(f"{author}: {loss:.4f}")
# Lower loss indicates more similar style (likely author)
Training Procedure
Dataset
The model was trained on the complete works of F. Scott Fitzgerald sourced from Project Gutenberg. The text was preprocessed to:
- Remove Project Gutenberg headers and footers
- Convert all text to lowercase
- Remove chapter headings and non-narrative text
- Preserve punctuation and structure
See the Fitzgerald corpus dataset for details.
Hyperparameters
| Parameter | Value |
|---|---|
| Context length | 1,024 tokens |
| Batch size | 16 |
| Learning rate | 5×10⁻⁵ |
| Optimizer | AdamW |
| Training tokens | 840,883 |
| Epochs | 50,000 |
| Final loss | 1.3564 |
Training Method
The model was initialized with a compact GPT-2 architecture (8 layers, 128-dimensional embeddings) and trained exclusively on F. Scott Fitzgerald's works until reaching a training loss of approximately 1.3564. This intensive training enables the model to capture fine-grained stylistic patterns characteristic of Fitzgerald's writing.
See the GitHub repository for complete training code and methodology.
Intended Use
Primary Uses
- Research: Stylometric analysis, authorship attribution studies
- Education: Demonstrations of computational stylometry
- Creative: Generate text in F. Scott Fitzgerald's style
- Analysis: Compare writing styles across historical periods
Out-of-Scope Uses
This model is not intended for:
- Factual information retrieval
- Modern language generation
- Tasks requiring uppercase text
- Commercial publication without attribution
Limitations
- Lowercase only: All generated text is lowercase (due to preprocessing)
- Historical language: Reflects Jazz Age America vocabulary and grammar
- Training data bias: Limited to F. Scott Fitzgerald's published works
- Small model: Compact architecture prioritizes training speed over generation quality
- No factual grounding: Generates stylistically similar text, not historically accurate content
Evaluation
This model achieved perfect accuracy (100%) in distinguishing F. Scott Fitzgerald's works from seven other classic authors in cross-entropy loss comparisons. See the paper for detailed evaluation results.
Citation
If you use this model in your research, please cite:
@article{StroEtal25,
title={A Stylometric Application of Large Language Models},
author={Stropkay, Harrison F. and Chen, Jiayi and Jabelli, Mohammad J. L. and Rockmore, Daniel N. and Manning, Jeremy R.},
journal={arXiv preprint arXiv:2510.21958},
year={2025}
}
Contact
- Paper & Code: https://github.com/ContextLab/llm-stylometry
- Issues: https://github.com/ContextLab/llm-stylometry/issues
- Contact: Jeremy R. Manning ([email protected])
- Lab: Context Lab, Dartmouth College
Related Models
Explore models for all 8 authors in the study:
- Downloads last month
- 12